REMARKS 



Claims 1-21 are currently active. 

Antecedent support for the amendments regarding the limitation of "portions of 
the packets as stripes" is found in Claim 13. 

Substitute figures, with an amendment showing the network 12, is enclosed in 
response to the drawing objection. 

The abstract has been amended, per the Examiner *s comments, to be within 150 

words. 

The Examiner has objected to the claims for various informalities. Pursuant to 
the Examiner's request, the applicant has amended the claims to obviate these objections. 

The Examiner has rejected Claim 13 under 35 U.S.C. 112, first paragraph. 
The applicant submits that the limitation "changing the number of port cards in the switch" 
refers to physically removing/adding port cards because only then are the port cards "in" the 
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switch. One skilled in the art is well versed in how to change the number of port cards in a 
switch. 

The Examiner has rejected Claims 6, 9, 10, 13, 14, 16, 19 and 20 under 35 
U.S.C. 112, second paragraph. 

In regard to Claim 6, it is submitted that Claim 6 is consistent with Claim 3. 
Claim 3 simply identifies the "input lookup". Claim 6 simply states a size limitation regarding 
the input lookup and that it has a 10 bit field. This does not refer to a signal in and of itself, 
but the size of the field inside the input lookup. It is submitted that Claim 6 is clear and 
definite to one skilled in the art. 

In regard to Claim 9, it is now dependent to Claim 7. 

In regard to Claim 10, it is now dependent to Claim 6. 

Claim 13 has been amended where the "from" has been changed to - by - . 
Claim 13 has also been amended to report to port cards in line 24 and not fabrics. 

Claim 14 has been amended to make it clear which storing step is referred to. 
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Claim 16 is clear and definite for the reasons explained above in regard to 

Claim 6. 

Claim 19 has been amended to refer to the sending portions of packets step so it 
is supported by antecedent basis. 

Claim 20 has been amended to refer more specifically to sending the stripes to 
an aggregator step. 

Accordingly, it is now submitted that the claims rejected under 35 U.S.C. 112, 
second paragraph are clear and definite and this rejection has been obviated. 

The Examiner has rejected Claims 1-3 as being unpatentable over Chiussi in 
view of Parruck and further in view of Newman. In view of the amendments to Claim 1 , it is 
submitted that none of these references in any way teach or suggest the limitation of portions 
of packets as stripes that is now found in Claim 1 . This is further supported by the fact that 
the Examiner has not rejected Claim 13 which also has such a limitation in view of the applied 
art of record in regard to Claims 1-3. Accordingly, Claims 1-3 are patentable over Chiussi in 
view of Parruck and Newman. 
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A substitute clean specification and marked up original specification are 



enclosed. The marked original specification has deletions bracketed and additions underlined. 
No new matter has been added. The information deleted is unnecessary for enablement and is 
considered superfluous information that applicant desires not to have published. 

In view of the foregoing amendments and remarks, it is respectfully requested 
that the outstanding rejections and objections to this application be reconsidered and 
withdrawn, and Claims 1-21, now in this application be allowed. 



Respectfully submitted. 




Attorney for Applicants 



-14- 



DYNAMIC QUEUE UTILIZATION 



FIELD OF THE INVENTION 

The present invention is related to the dynamic 
reconfiguration of a switch where the number of port cards change 
but the number of fabrics stays fixed. More specifically, the 
present invention is related to the dynamic reconfiguration of a 
switch where the number of port cards change but the number of 
fabrics stays fixed with an input lookup that identifies which 
queue packets should be placed in without having to tear down or 
alter any connection data structure in the packet. 

BACKGROUND OF THE INVENTION 

BFS is a switch if FORE Systems, Warrendale, 
Pennsylvania, with distributed queueing and a variable number of 
output ports- but a fixed number of queues. If queue assignments to 
output ports are constant, then the number of queues which can be 
utilized in switch configurations with a smaller number of ports 
are not optimal, since a large amount of hardware resource is 
unused . 

The present invention allows the extra hardware which 
would be waiting for some future expansion in switch capacity to be 
utilized before that capacity is installed. After the switch 
capacity is upgraded, then the hardware can be dynamically 
reallocated to support the new output ports . The process can be 
run in reverse if the switch capacity is downgraded. This 
reconfiguration can be accomplished without tearing down or 
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altering any per connection data structure in software and 
guarantees ordered packet/cell delivery. 

SUMMARY OF THE INVENTION 

The present invention pertains to a switch for switching 
5 packets in a network. The switch comprises port cards which send 
packets to and receive packets from the network. The switch 
comprises fabrics connected to the port cards for switching 
portions of the packets. Each fabric has queues in which portions 
of packets are stored. Each queue corresponds to one of the port 
10 cards. Each fabric has a determining mechanism which determines 
which queue the portions of the packet should be placed in. The 
detecting mechanism is dynamic to reflect changes in port card 
quantity without any change in connection data of the packets. 

The present invention pertains to a method for switching 
15 packets in a network. The method comprises the steps of receiving 
packets at port cards of a switch from the network. Then there is 
the step of sending portions of the packets as stripes to a 
respective number of fabrics of the switch. Next there is the step 
of storing the respective portions of packets in queues of the 
20 fabric corresponding to port cards the portions of the packets are 
to be sent to from the respective fabrics. Then there is the step 
of sending the portions of packets as stripes to the port card. 
Next there is the step of transmitting packets from the port card 
to the network. Then there is the step of changing the number of 
25 port cards in the switch. Next there is the step of receiving more 
packets at the port cards. Then there is the step of sending 
portions of the more packets to the number of the fabrics after the 
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number of the fabrics has changed. Then there is the step of 
storing the portions of the more packets in the queues 
corresponding to the port cards the portions of the packets are to 
be sent to without any change to connection data in the packets. 

5 BRIEF DESCRIPTION OF THE DRAWINGS 

In the accompanying drawings, the preferred embodiment of 
the invention and preferred methods of practicing the invention are 
illustrated in which: 

Figure 1 is a schematic representation of packet striping 
10 in the switch of the present invention. 

Figure 2 is a schematic representation of an OC 48 port 

card . 

Figure 3 is a schematic representation of a concatenated 
network blade. 

15 Figure 4 is a schematic representation regarding the 

connectivity of the fabric ASICs. 

Figure 5 is a schematic representation of a 32 bit cell 

transfer . 



Figure 6 irs a schematic representation regarding 

2 0 back -pressure . 
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Figure 7 is a £;chematic representation of a 32 bit packet 

transferred using external connection number bus. 

Figure 0 is a schematic representation of a bit cell 

transferred, 

5 Figure D is a schematic representation of a G4 bit packet 

transfer . 

Figure 10 is a schematic representation of ATM cell flow 

in the switch. 

Figure [[11]] 5i is a schematic representation of sync 
10 pulse distribution. 

Figure — i-2 — i-s — a — schematic — representation — regarding — t+te 
write cycle. 

Figure — iS — ts — a — schematic — representation — — tite — read 

cycle . 

15 Figure 14 — is a schematic representation of the striper 

AOIC architecture. 

Figure 15 is a schematic presentation of the aggregator 

AOIC archit ture ■ 



Figure — 3-6 — i-s — a — schematic — representation — of — a — memory 

2 0 controller AOIC architecture. 
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Figure 17 15 a schematic representation of the wide cache 

line shared memory architecture - 

Figure — 3r8 — is a schematic representation of a separator 

AGIC architecture . 

5 Figure 19 is a schematic representation of an unstriper 

AOIC architecture. 

Figure [[20]] ^ is a schematic representation regarding 
the relationship between transmit and receive sequence counters for 
the separator and unstriper, respectively. 

10 Figure — 2-i — is — a — schematic — representation — of — a — receive 
synchronizer . 

Figure [[22]] 2 is a schematic representation of a switch 
of the present invention. 

DETAILED DESCRIPTION 

Referring now to the drawings wherein like reference 
numerals refer to similar or identical parts throughout the several 
views, and more specifically to figure [[22]] 1_ thereof, there is 
shown a switch 10 for switching packets in a network 12. The 
switch 10 comprises port cards 14 which send packets to and receive 
packets from the network 12. The switch 10 comprises fabrics 16 
connected to the port cards 14 for switching portions of the 
packets. Each fabric 16 has queues 18 in which portions of packets 
are stored. Each queue 18 corresponds to one of the port cards 14. 
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Each fabric 16 has a determining mechanism 20 which determines 
which queue 18 the portions of the packet should be placed in. The 
detecting mechanism is dynamic to reflect changes in port card 14 
quantity without any change in connection data of the packets. 

5 Preferably, each fabric 16 has a memory controller 22 

having the queues 18 and the detecting mechanism. The detecting 
mechanism preferably includes an input lookup 24 which identifies 
in which queue 18 portions of the packet are placed. Preferably, 
the input lookup 24 identifies more queues 18 than are present in 
10 the switch 10. The fabric 16 preferably receives a first signal 
from the network 12 which identifies which queues 18 correspond to 
which output ports. 



Preferably, the input lookup 24 has a 10-bit field. The 
fabric 16 preferably receives a second signal which identifies 

15 which bits of the 10-bit field are to be used to identify the queue 
18 the portions of the packet are to be stored in. Preferably, the 
10-bit field comprises bits 0-7 which identifies the output port to 
which the queue 18 connects and bits 8 and 9 identifies a priority 
of the portions of the packet. The second signal preferably has a 

20 2-bit field which indicate which 8 of the 10 bits of the input 
lookup 24 are to be used to identify the queue 18 the portions of 
the packet are to be stored in. Preferably, the 8 bits of the 10 
bits can be either bits 0-5, 8 and 9 which are 4 priorities on up 
to 64 output ports, or bits 0-6 and 8 which are 2 priorities up to 

25 128 output ports, or bits 0-7 which are 1 priority up to 256 output 
ports . 
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The fabric 16 preferably has an aggregator 26 which 
receives portions of packets and connects to the memory controller 
22, and a separator 30 which connects to the memory controller 22 
and sends portions of the packets to the port cards 14. 
5 Preferably, the port card 14 includes a striper 32 which sends 
portions of packets as stripes to the aggregator 26 of each fabric 
16, and an unstriper 34 which receives portions of packets as 
stripes from the separator 30 of each fabric 16. 

The present invention pertains to a method for ^switching 

10 packets in a network 12. The method comprises the steps of 
receiving packets at port cards 14 of a switch 10 from the network 
12, Then there is the step of sending portions of the packets as 
stripes to a respective number of fabrics 16 of the switch 10. 
Next there is the step of storing the respective portions of 

15 packets in queues 18 of the fabric 16 corresponding to port cards 
14 the portions of the packets are to be sent to from the 
respective fabrics 16. Then there is the step of sending the 
portions of packets as stripes to the port card 14. Next there is 
the step of transmitting packets from the port card 14 to the 

20 network 12. Then there is the step of changing the number of port 
cards 14 in the switch 10. Next there is the step of receiving 
more packets at the port cards 14. Then there is the step of 
sending portions of the more packets to the number of the fabrics 
16 after the number of the fabrics 16 has changed. Then there is 

25 the step of storing the portions of the more packets in the queues 
18 corresponding to the port cards 14 the portions of the packets 
are to be sent to without any change to connection data in the 
packets . 



-8- 



Preferably, the storing step includes the step of looking 
up in an input lookup 24, which identifies in which queue 18 
portions of the packets are placed, which queue 18 the portions of 
the packets are to be placed. After the changing step, there is 
5 preferably the step of receiving a first signal which identifies in 
which queues 18 portions of the packets are to be placed. 
Preferably, after the receiving the first signal step, there is the 
step of receiving a second signal which identifies which bits of a 
10 bit field of the input lookup 24 are to be used to identify the 
10 queue 18 the portions of the packet are to be stored in. 

The receiving the second signal step preferably includes 
the step of reviewing a 2-bit field of the second signal which 
indicates which 8 of the 10 bits of the input lookup 24 are to be 
used to identify the queue 18 the portions of the packets are to be 
15 stored in. Preferably, each fabric 16 has a memory controller 22 
having the queues 18 and the sending portions of packets step 
includes the step of sending the stripes to an aggregator 26 of 
each fabric 16 which receives portions of packets and connects to 
the memory controller 22. 

20 The portions step preferably includes the step of sending 

with a separator 30 of the fabric 16 which connects to the memory 
controller 22 portions of the packets as stripes to the port cards 
14. Preferably, the sending portions step includes the step of 
sending with a striper 32 portions of packets as stripes to the 

25 aggregator 26 of each fabric 16. After the sending with the 
separator 30 step, there is preferably the step of receiving the 
stripes from the separator 30 of each fabric 16 at an unstriper 34 
of each port card 14. 
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In the operation of the invention, the switch 10 has an 
input lookup 24 which identifies a shared memory queue 18 which the 
traffic should be placed in. The queue 18 identifier identifies 
more queues 18 than are present in the switch 10. 



The format of the queue 18 identifier is a 10 bit field: 

Bit 9:8 - identifies the priority of the traffic 
Bit 7:0 - identifies the output port 



At the queue 18 resynch events, each memory controller 22 
receives a 2 bit field which indicates the bits which should be 
10 used as the 8 bit queue 18 identified field: 



9:8 + 5:0 - 4 priorities on up to 64 output ports 
8+6:0-2 priorities on up to 128 output ports 
7:0-1 priority on up to 256 output ports 

Queue 18 resynch has two important properties as it 
15 applies to changing the output queue 18 of incoming traffic. 

A. All old traffic is dequeued before any new traffic 
is dequeued. 

B. The queue resynch event is synchronous. 



Property A ensures that traffic which changes queues 
20 cannot get reordered. Property B ensures that the switch 10 
fabrics 16 are not thrown out of synch when the changing of the 
queueing is done (know all fabrics 16 will enqueue the same packet 
in the same queue) . 
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The switch uses RAID techniques to increase overall 
switch bandwidth while minimizing individual fabric bandwidth. In 
the switch architecture, all data is distributed evenly across all 
fabrics so the switch adds bandwidth by adding fabrics and the 
5 fabric need not increase its bandwidth capacity as the switch 
increases bandwidth capacity. 

Each fabric provides 40G of switching bandwidth and the 
system supports 1, 2, 3, 4, 6, or 12 fabrics, exclusive of the 
redundant/spare fabric. In other words, the switch can be a 40G, 
10 80G, 120G, 160G, 240G, or 480G switch depending on how many fabrics 
are installed. 

A portcard provides lOG of port bandwidth. For every 4 
portcards, there needs to be 1 fabric. The switch architecture 
does not support arbitrary installations of portcards and fabrics. 

15 The fabric ASICs support both cells and packets. As a 

whole, the switch takes a "receiver make right" approach where the 
egress path on ATM blades must segment frames to cells and the 
egress path on frame blades must perform reassembly of cells into 
packets . 

20 There are currently eight switch ASICs that are used in 

the switch: 

A. Striper - The Striper resides on the portcard and 
SCP-IM. It formats the data into a 12 bit data 
stream, appends a checkword, splits the data stream 
25 across the N, non-spare fabrics in the system, 



generates a parity stripe of width equal to the 
stripes going to the other fabric, and sends the 
N+1 data streams out to the backplane. 

Unstriper - The Unstriper is the other portcard 
ASIC in the the switch architecture. It receives 
data stripes from all the fabrics in the system. It 
then reconstructs the original data stream using 
the checkword and parity stripe to perform error 
detection and correction. 

Aggregator - The Aggregator takes the data streams 
and routewords from the Stripers and multiplexes 
them into a single input stream to the Memory 
Controller . 

Memory Controller - The Memory controller 
implements the queueing and dequeueing mechanisms 
of the switch. This includes the proprietary wide 
memory interface to achieve the simultaneous en- 
/de-queueing of multiple cells of data per clock 
cycle. The dequeueing side of the Memory Controller 
runs at SOGbps compared to 40Gbps in order to make 
the bulk of the queueing and shaping of connections 
occur on the portcards. 

Separator - The Separator implements the inverse 
operation of the Aggregator. The data stream from 
the Memory Controller is demultiplexed into 
multiple streams of data and forwarded to the 



-12- 



appropriate Unstriper ASIC. Included in the 
interface to the Unstriper is a queue and flow 
control handshaking . 

Trident Trident is, — strictly speakiacj , — not one o r 

5 the ASICs. — It is actually one half of the Poseidon 

chipset . — Trident will be used to implement the ATM 
portcards within the switch. 

S-: Vortex Vortex — ts — ttre — partner — t-o — Trident — — ttre 

Poseidon — chipset . — Vortex — irs — ttre — ingress — AGIC — mrdc 
10 Trident the egress device. — Together — the two chips 

implement — a — 2 . 5Gbps — ingress, — SGbps — egress — system 
capable of supporting up to OC - 40c ports. 

Reassembler — Reassembler — AGIC — irs — thre — frame 

blade equivalent to Trident. — It will be capable of 

15 taking cell streams from tir^ Unstriper ^mdc 

converting them into frames. 

There are 3 different views one can take of the 
connections between the fabric: physical, logical, and "active." 
Physically, the connections between the portcards and the fabrics 

20 are all gigabit speed differential pair serial links. This is 
strictly an implementation issue to reduce the number of signals 
going over the backplane. The "active" perspective looks at a 
single switch configuration, or it may be thought of as a snapshot 
of how data is being processed at a given moment. The interface 

25 between the fabric ASIC on the portcards and the fabrics is 
effectively 12 bits wide. Those 12 bits are evenly distributed 
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C'striped") across 1, 2, 3, A, 6, or 12 fabrics based on how the 
fabric ASICs are configured. The "active" perspective refers to the 
number of bits being processed by each fabric in the current 
configuration which is exactly 12 divided by the number of fabrics. 

5 The logical perspective can be viewed as the union or max 

function of all the possible active configurations. Fabric slot #1 
can, depending on configuration, be processing 12, 6, 4, 3, 2, or 
1 bits of the data from a single Striper and is therefore drawn 
with a 12 bit bus. In contrast, fabric slot #3 can only be used to 
10 process 4, 3, 2, or 1 bits from a single Striper and is therefore 
drawn with a 4 bit bus. 

Unlike previous switches, the switch really doesn't have 
a concept of a software controllable fabric redundancy mode. The 
fabric ASICs implement N+1 redundancy without any intervention as 
15 long as the spare fabric is installed. 

As far as what does it provide; N+1 redundancy means that 
the hardware will automatically detect and correct a single failure 
without the loss of any data. 

The way the redundancy works is fairly simple, but to 
20 make it even simpler to understand a specific case of a 120G switch 
is used which has 3 fabrics (A, B, and C) plus a spare (S) . The 
Striper takes the 12 bit bus and first generates a checkword which 
gets appended to the data unit (cell or frame) . The data unit and 
checkword are then split into a 4-bit-per-clock-cycle data stripe 
25 for each of the A, B, and C fabrics (A3A2A1A0, B3B2B1B0, and C3C2C1C0) . 
These stripes are then used to produce the stripe for the spare 
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fabric S3S2S1S0 where = XOR B^, XOR and these 4 stripes are 
sent to their corresponding fabrics. On the other side of the 
fabrics, the Unstriper receives 4 4-bit stripes from A, B, C, and 
S. All possible combinations of 3 fabrics (ABC, ABS, ASC, and SBC) 
5 are then used to reconstruct a "tentative" 12-bit data stream. A 
checkword is then calculated for each of the 4 tentative streams 
and the calculated checkword compared to the checkword at the end 
of the data unit. If no error occurred in transit, then all 4 
streams will have checkword matches and the ABC stream will be 
10 forwarded to the Unstriper output. If a (single) error occurred, 
only one checkword match will exist and the stream with the match 
will be forwarded off chip and the Unstriper will identify the 
faulty fabric stripe. 

For different switch configurations, i.e. 1, 2, 4, 6, or 
15 12 fabrics, the algorithm is the same but the stripe width changes. 

If 2 fabrics fail, all data running through the switch 
will almost certainly be corrupted. 

There are basically two options, both requiring that the 

defective fabrics be known through some means. Unfortunately, — trt 

2 0 a double failure system, — the hardware that detects and identifies 
n — failed — fabric — will — only — b« — able — tro — identify — thre — fabric — that 
failed — first — tirf — there — wa:s — one) . — Identifying — both — ttre — failed 
fabrics may only be — possible — through a — trial and error — approach 
unless the — switch — software and/or — switch diagnostics can develop 

2 5 tests to identify the second failure. — 
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The recommended approach would be to shut down the switch 

and install as many good fabrics as possible beginning with slot 1. 
This allows the maximum bandwidth and redundancy be available given 
'bhe — functional hardware available. 

5 Wre — other — option — irs — to — have — t^'Te — switch — software 

reconfigure the switch to use fewer — fabrics . — This is an inferior 
solution for two reasons ; 

irz ft — esm — never — provide — more — bandwidth — than — the 

recommended approach . 

10 Sr. ft — requires — substantial — thought — arrd — understanding 

^ — the — switch — by — the — user — irt — order — trsj — determine 
what is the maximum operational configuration. 

Basically, the user must start at fabric slot 1 and count 

the — number — — operational — fabrics , i-f — the — spare — fabric — 37:5- 

15 operational , — then — it — mery — be — used — to — '^' ^ cover^^ — trsrr — the — first — non- 
operational — fabrics . 

Exam p l e — — A r e dundant — 240G — s w itch — (6 + 1 — fabrics -) — htts — suff e r e d 
fab ric failur e s in sl o ts 3 and 4. Starting with slot 1 th e r e ar e 2 
operational fabrics and the spare is available to cover for slot 3. 
2 0 This switch can be reconfigured to a 120G non redundant switch or 
an OOG redundant switch. — Note than by swapping fabric 0 and G into 
slots 3 and 4^ — this switch could be a IGOG redundant switch. 

Exam p l e — #2-: — A r e dundant — 4©^6 — switch — suff e rs — fabric — failures — ±Tt 
sl o ts 1 and th e s p ar e . Start swapping fabrics. — Slot 1 is d e ad and 
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the spare is not available to cover for it. — This — is the worst case 
scenario . 

Exampl e — #3-: — ft — r e dundant — 400G — s w itch — suff e rs — fabric — failur e s — ±rt 

slots 2 and 10- Th e r e is on e functional fabric counting from slot 

5 1 or 9 if the spare is used to cover for slot 2. — This switch can be 
configured either as 40G redundant or 240G non- redundant . Note that 
fabrics 7,0, — and 9 do not help since the only legal configuration 
after G fabrics is all 12. 

The fabric slots are numbered and must be populated in 
10 ascending order. Also, the spare fabric is a specific slot so 
populating fabric slots 1, 2, 3, and 4 is different than populating 
fabric slots 1, 2, 3, and the spare. The former is a 160G switch 
without redundancy and the latter is 120G with redundancy. 

Firstly, the ASICs are constructed and the backplane 
15 connected such that the use of a certain portcard slots requires 
there to be at least a certain minimum number of fabrics installed, 
not including the spare. This relationship is shown in Table 0. 

In addition, the APS redundancy within the switch is 
limited to specifically paired portcards. Portcards 1 and 2 are 
20 paired, 3 and 4 are paired, and so on through portcards 47 and 48. 
This means that if APS redundancy is required, the paired slots 
must be populated together. 

To give a simple example, take a configuration with 2 
portcards and only 1 fabric. If the user does not want to use APS 
25 redundancy, then the 2 portcards can be installed in any two of 



-17- 



portcard slots 1 through 4. If APS redundancy is desired, then the 
two portcards must be installed either in slots 1 and 2 or slots 3 
and 4 . 



Portcard 


Minimum 


Slot 


# of 




Fabrics 


1-4 


1 


5-8 


2 


9-12 


3 


13-16 


4 


17-24 


6 


25-48 


12 



Table 0: Fabric Requirements for Portcard Slot Usage 

To add capacity, add the new fabric (s), wait for the 
switch to recognize the change and reconfigure the system to stripe 
15 across the new number of fabrics. Install the new portcards . 

Note that it is not technically necessary to have the 
full 4 portcards per fabric. The switch will work properly with 3 
fabrics installed and a single portcard in slot 12. This isn't cost 
efficient but it will work. 

20 To remove capacity, reverse the adding capacity 

procedure . 

If the switch is oversubscribed, i.e. install 8 portcards 
and only one fabric. 

It should only come about as the result of improperly 
25 upgrading the switch or a system failure of some sort. The reality 
is that one of two things will occur, depending on how this 
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situation arises. If the switch is configured as a 40G switch and 
the portcards are added before the fabric, then the 5^^ through 8^^ 
portcards will be dead. If the switch is configured as 80G non- 
redundant switch and the second fabric fails or is removed then all 
5 data through the switch will be corrupted (assuming the spare 
fabric is not installed) . And just to be complete, if 8 portcards 
were installed in an 80G redundant switch and the second fabric 
failed or was removed, then the switch would continue to operate 
normally with the spare covering for the failed/removed fabric. 

10 The switch includes the following features: 

Scales from ^OGbpg to 400Gbp5 (40, 00, 120, IGO, 240, 400 

GD/sec are the supported configurations). 

Switches ATM cells and variable - length packets 

Nil fabric redundancy with error detection and recovery 

15 supported in the AGIO chipset. — 

Native APO support 

■* Support up to 19GK cell shared memory, — D21GK unicast and 

G4K multicast connections. 

Support 2x port speed for fabric dequeueing — (2.5 GD/sec 
2 0 iTTT — 5 GD/sec out for each OC40 port) , 



Supports both OC40c ports and OC192c ports. 



-19- 



Provides — port /priority — queuing — similar — to — past — switch 

fabrics . — Four priorities are provided for 40 — 120 GD/sec 
switches^. — 2 priorities/port for 240 GD/sec switches and 
1 priority for 400 GD/sec switches. 

5 ASICs utilize 250 MHz IIGTL point to point busses between 

fabric ADICs and interface with the backplane using stan ' 
dard GDit transceivers. 

Interface — — port — cards — chips — ttse — 00 125 — MH^ — LVTTL 

signals . 

10 Support output port supplied back pressure. 

5*re — significant — architectural — difference — between — the 

switch — and past — switches — — that — incoming — traffic — is — routed — t^ 
multiple S W 1 1 c h — fabrics . — Each — fabric — i-s — designed — — enqueue — 4^ 
GD/sec — (rf — data — and — dequeue — 8-6 — GD/sec — orf — data . — As — data come s 'X n't"©" 

15 the switch, — it is broken up on a bit by bit basis and part of each 
packet is sent to each fabric in the box. The fabrics will all make 
the same enqueuing and drop decisions, — and all schedule fragments 
of a packet/cell at the same time. Each fabric sends its portion of 
the packet or cell to the output port card which reassembles the 

2 0 fragment — into the complete cell/packet which is then passed to a 
shared memory AOIC for per port storage and scheduling. The XOR of 
t+re — data — sent — t-o — each — fabric — — sent — t-o — a — spare — fabric . — in — the- 
event — of a — fabric failure, — that — fabrics data can be recovered by 
utilizing — the — good — data — bits — &rrd — the — parity — fabric — bits — tcr 

2 5 recalculate — arry — fabrics — data . Wte — striping — of — data — to — fabrics 
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happens on the basis of 40 bit chunks, — This allows the switch to 
support l,2,3f4,C and 12 fabrics. 

Five — AOlCs — build — ttre — switching — functionality — for — the 

switch . — These AOlCs are described briefly below. 

5 TABLE 1 : Th e switch AOICs 



t C 1 


Tunction 


Striper 




Takes incoming eel! from Vortex (or QC192e equivalent) or from POG input stage and breaks the data up 


into the appropriate chunks to go to each fabrie, calculates the parity for the spare fabric, coneatcnatcs a 


ehecksum onto the packet, separates the routcword and data into separate routewoH and data busses which 
run across the backplane. 






Aggregator 


Receives separate data and routcword busses from multiple stripers. Converts from the reasonably slim 
dedicated striper-> Aggregator busses to a wide shared bus to the memory controllers. 


Memory 
Controllers 


Actually perform the queueing of data for the fabrics. Queues the cell into one of 200 queues (192 UC queues 




4 MC queues and 4 control port queues). — All drops which occur in the chipset occur here. 


Separator 


Combines traffic from multiple memory controllers to one fabric output. Provides rate control of the stream 


of data leaving the fabric for each OC48 or OC192c port. 


Unstriper 


Receives data from multiple separators. Combines traffic and error checks the received data. Detects errors on 


any fabric and attempts to reconstruct the good data. Passes the data to the output memory controller. If the 
striper is on an ATM blade and the data is a packet, it is segmented before passing onto the ATM controller. 



Figure 1 shows packet striping in the switch. 



The chipset supports ATM and POS port cards in both OC48 
15 and OC192c configurations. OC48 port cards interface to the 
switching fabrics with four separate OC48 flows. OC192 port cards 
logically combine the 4 channels into a lOG stream. The ingress 
side of a port card does not perform traffic conversions for 
traffic changing between ATM cells and packets. Whichever form of 
20 traffic is received is sent to the switch fabrics. The switch 
fabrics will mix packets and cells and then dequeue a mix of 
packets and cells to the egress side of a port card. 
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The egress side of the port is responsible for converting 
the traffic to the appropriate format for the output port. This 
convention is referred to in the context of the switch as ''receiver 
makes right". A cell blade is responsible for segmentation of 
5 packets and a cell blade is responsible for reassembly of cells 
into packets. To support fabric speed-up, the egress side of the 
port card supports a link bandwidth equal to twice the inbound side 
of the port card. — For each OC40 interface, — the unstriper supports 
a bandwidth of CGD/sec and for each QC192 interface, a bandwidth of 
10 24 GD/sec — ( combined routeword — t — data) - 

The block diagram for a Poseidon-based ATM port card is 
shown as in Figure 2. Each 2 . 5G channel consists of 4 ASICs: Vortex 
Inbound TM and striper ASIC at the inbound side and unstriper ASIC 
and Trident outbound TM ASIC at the outbound side. 

At the inbound side, the Vortex ASIC aggregates 1 OC-48c 
or 4 0C-12c interfaces are aggreaated . Each vortex sends a 2 . 5G 
cell stream into a dedicated striper ASIC (using the BIB bus, as 
described below) . The striper converts the vortex supplied 
routeword into two pieces. A portion of the routeword is passed to 
the fabric to determine the output port(s) for the cell. The 
entire routeword is also passed on the data portion of the bus as 
a routeword for use by the outbound memory controller. The first 
routeword is termed the ""fabric routeword". The routeword for the 
outbound memory controller is the "'egress routeword". 

25 At the outbound side, the unstriper ASIC in each channel 

takes traffic from each of the port cards, error checks and correct 
the data and then sends correct packets out on its output bus. The 
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unstriper uses the data from the spare fabric and the checksum 
inserted by the striper to detect and correct data corruption 
5Gbps — traffic is then sent to the Trident AGIC of the — Poseidon 
chipset. The Trident AOIC stores the incoming cells based on per VC 

5 queues — smd — sends — them — ovct — txj — ©€ 12c/OC '- 4Qc — interfaces — at- 

aggregated speed of 2.DGbps. 

For the FOG interfaces, the striper AOIC input bus speeds 

trp — tro — 3 . 2Gbps — to — handle — POS — overhead, — Wre — outbound — side, — th^ 
unstriper — talks — tro — a — reassembly — stage — which — — currently — being 
10 defined. 

Figure 2 shows an OC48 Port Card. 

The OC192 port card supports a single lOG stream to the 
fabric and between a lOG and 20G egress stream. This board also 
uses 4 stripers and 4 unstriper, but the 4 chips operate in 
15 parallel on a wider data bus. The data sent to each fabric is 
identical for both OC48 and OC192 ports so data can flow between 
the port types without needing special conversion functions. 

Figure 3 shows a lOG concatenated network blade. 

Each 40G switch fabric enqueues up to 40Gbps cells/frames 
20 and dequeue them at SOGbps. This 2X speed-up reduces the amount of 
traffic buffered at the fabric and lets the outbound ASIC digest 
bursts of traffic well above line rate. A switch fabric consists of 
three kinds of ASICs:' aggregators, memory controllers, and 
separators. Nine aggregator ASICs receive 40Gbps of traffic from 
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up to 48 network blades and the control port. The aggregator ASICs 
combine the fabric route word and payload into a single data stream 
and TDM between its sources and places the resulting data on a wide 
output bus. An additional control bus (destid) is used to control 
5 how the memory controllers enqueue the data. The data stream from 
each aggregator ASIC then bit sliced into 12 memory controllers . 



The memory controller receives up to 16 cells/frames 
every 2 50MII2: clock cycle. Each of 12 ASICs stores 1/12 of the 
aggregated data streams. It then stores the incoming data based 
10 on control information received on the destid bus. Storage of data 
is simplified in the memory controller to be relatively unaware of 
packet boundaries (cache line concept) . All 12 ASICs dequeue the 
stored cells simultaneously at aggregated speed of SOGbps, 



Nine separator ASICs perform the reverse function of the 
15 aggregator ASICs. Each separator receives data from all 12 memory 
controllers and decodes the routewords embedded in the data streams 
by the aggregator to find packet boundaries. Each separator ASIC 
then sends the data to up to 24 different unstripers depending on 
the exact destination indicated by the memory controller as data 
20 was being passed to the separator. 



The dequeue process is back-pressure driven. If 
back-pressure is applied to the unstriper, that back-pressure is 
communicated back to the separator. The separator and memory 
controllers also have a back-pressure mechanism which controls when 
25 a memory controller can dequeue traffic to an output port. 
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In order to support OC48 and OC192 efficiently in the 
chipset, the 4 OC48 ports from one port card are always routed to 
the same aggregator and from the same separator (the port 
connections for the aggregator & Sep are always symmetric). The 
5 table below shows the port connections for the aggregator & sep on 
each fabric for the switch configurations. Since each aggregator 
is accepting traffic from lOG of ports, the addition of 40G of 
switch capacity only adds ports to 4 aggregators. This leads to a 
differing port connection pattern for the first four aggregators 
10 from the second 4 (and also the corresponding separators) . 



TABLE 2: Agg/Sep port connections 

Switch Size Agg 1 Agg 2 Agg 3 Agg 4 Agg 5 Agg 6 Agg7 Agg 8 

40 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16 

80 1,2,3,4 '5,6,7,8 9,10,11,12 13,14,15,16 17,18,19,20 21,22,23,24 25,26,27,28 29,30,31,32 

15 120 1,2,3,4 5,6,7,8 9,10,11,12, 13,14,15,16, 17,18,19,20 21,22,23,24 25,26,27,28 29,30,31,32 

33,34,35,36 37,38,39,40 41,42,43,44 45,46,47,48 

160 1,2,3,4 5,6,7,8 9,10,11,12, 13,14,15,16, 17,18,19,20, 21,22,23,24, 25,26,27,28, 29,30,31,32, 

33,34,35,36 37,38,39,40 41,42,43,44 45,46,47,48 49,50,51,52 53,54,55,56 57,58,59,60 61,62,63,64 



Figure 4 shows the connectivity of the fabric ASICs. 



The external interfaces of the switches are the Input Bus 
(BIB) between the striper ASIC and the ingress blade ASIC such as 
20 Vortex and the Output Bus (BOB) between the unstriper ASIC and the 
egress blade ASIC such as Trident. 



Two variations — of routewords are — supported. — first 

option — uses — orr^ — 92 — h±t — routeword which — — passed — t-o — t+re — egress 
board as the egress routeword and hag fields extracted to form the 
25 fabric routeword. The second option allows the striper to accept 
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both — a — fabric — routeword — (which — happens — on — a — dedicated — routeword 
bus ) — and an egress route word — (which is received on the data bus) . 
Wte — second option — ir^ — more — flexible — cm — connection — space — usage — and 
expansion since that allows all 32 bits of the routeword to be used 
5 to identify connections on switch egress. 

To maintain compatibility with Vortex, — bit 24 — is — still 

maintained as — the multicast bit . iHte — incoming — routeword has — fe+re- 

following format . — 



TABLE 3: 32 bit DID/DOD rout e word format 



bit 30:25 




bit 23:0 


Connection 10(29:28) & 
Connection 10(19:16) 


Multicast Bit 


Cemiection ID (27:20) & connection !D (15:0) 



The 20 bit conn ID in the routeword is set to 

MC bit & Connection ID — (29:5) — for UC connections which are 

15 rrcrt — special — routeword values 

MC bit Sl Connection ID (24 : 0) — for MC connecti ons or r or 

special — routeword unicast values. 

For UC connections, — although bits 2D : D are passed to 

the fabric, — only bits 29:20 are used. These bits should be pro - 

2 0 grammed with queue to be used. Bits 29:20 should be programmed 

with the priority and bits 27 : 20 progran'mied with the queue 
number . 

Note that the RW value used for the outbound memory 

controll e r — i-s — set to 
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^0^ Sl MC bit (k connection ID (29:0). 



If the fabric is using 10 bits of conn ID^ — this leaves 

20 bits — (1 M connections) — for use by the outbound memory 
controller . 



5 For double routewords^ — no manipulation is done. Wte 

value passed in on the routeword bus needs to equal to the 
connection ID to be transmitted on the backplane. — The following 
two tables show the routeword value which should be passed on the 
backplane routeword bus. — 



10 TABLE 4 : Unicast Conn e ction ID for s e parat e RW bus 



u : ^ -yr 


bit 24:23 


Bit 22:15 












Multicast bit-Q 


Tabric priority 


Fabric queue ID 


Future expansion bits. This bits arc 


transmitted to the fabric, but the cur rent 


Fabric ignores them. Future fabrics may 
expand to support these bits. 



TABLE 5 : Multicast Connection ID for separate RW bus 



umTj 



bi t 24 : 23 



bi t 22:16 



bi t 15 : 0 



Multicast bit - 



Pr i o r ity queue ID 



Reserved. Note these b i ts mult i cast connection ID (0 to 



are se nt to the fab r ic to 641C) used by the fabric 
allow — futu r e — fab ri cs — to 



s uppo r t mo r e connection 

3paccT 



Special routewords are flagged by using reserved queue 

numbers — (those in the range of 240 - 255) . These routeword values 

indicate the receipt of an 0AM cell which must get routed to the 

control port or a queue resynch operation. These special values 

2 0 are always expressed in terms of the connection ID which goes to 
the fabric. If special routewords are given to the fabric^ — t+re 
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memory controller routeword must also be modified if these are 
getting passed in using the separate connection number bus. 

The routeword passed to the — fabric v\;ill contain the 

multicast bit and the port mask bits — (bits 23 : IG) . The routeword 

5 passed to the outbound memory controller will maintain the port 
mask and also contain the vortex ID and the port ID. 

The connection ID of an 0AM cell has a special format 

generated by the Vortex AOIC: 



TABLE G 



Conn e ction ID for 0AM c e ll 



10 



Bi t 24:23 



bi t 22 : 15 



bi t 14:9 



bi t 7:0 



Multicast bit - 0 



Vo r tex \D(1 ^ 



OxrO (hex) 



Vo r tex ID (5 :9) 



' cscrvcd 



Port ID 



Wte — Vortex — fB — field — 37:3 — used — bo — indicate — which — source 

Vortex AGIO the cell comes from. — The port ID indicates which port 
the cell comes from inside the Vortex AGIO. Note that 0AM cells are 

15 all unicast. — All 0AM cells are destined to one of 19G blade and 
control — port — queues — programmed by — a — 0 - bit — 0AM cell — destination 

register — in the memory controller AGICs. If separate routeword 

busses — a-re — being — used, — bi-fe — 24 : IG — erf — t+te — DID_C0NN — field — will — be- 
passed to the fabric. The routeword which appears on the data bus 

2 0 (memory controller routeword) — should include the port mask;. — vortex 
ID and port ID fields in bits 23:0. — The value in the multicast bit 
is a don^t care for the memory controller routeword. 



Fabric queue ID OxFO 0xF7 of the unicast connection ID is 

reserved for software use. All packets which have the fabric queue 
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ID in range of OxFO OxFF will be redirected to one of the 4 control 
port queues based on a programmable register. 



— connection — fB — of — a — resync — cell — h^ns — tite — following 

format . — iH-^e — resync — cell — ts — used — to — resynchronize — queues — in — th^ 
5 memory controller ASICs. — Fabric queue ID OxFO OxFF of the unicast 
connection ID is reserved for special fabric functions. 



TABLE 7 : Conn e ction ID for R e sync c e ll 



bi t 22:15 



bi t 14:13 



bi t 12 : 0 



Mult i cast b i t - Q 



Pr i or i ty (u n used) 



OxFr (hex) 



^ ^ l u ni b c r of Rese r ved 



priorit i es pe r port 



10 Wte — number — erf — priority — queues — peif — port — earr — only — be 

changed — during — Hre — queue — resync — period^ — i . e . ^ — when — a — fabric — 
removed or inserted as follows: 



one — priority — per — port — f-or — 400G — switch, — pick — bit — i-& 
down to 0 — of the connection ID as the queue — fBr 
15 01 : — two priorities per port for 240G switch, — pick bit IG 

down to — 9 of the connection ID as the queue — fBr 

3r&-: — 4 — priorities — per port — &srr — 120G — or — smaller switch, 

pick bit 17 down to 10 of the connection ID as the queue 

2 0 11 : — reserved 

The resync cell can also be used to copy the shadow data 

register — to — a — valid — location — where — the — shadow — address — register 
points to. 
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Shadow — control — cell — irs — used — ^ — copy — t+te — shadow — data 

register — to — a — valid — location — where — the — shadow — address — register 
points to. — The connection ID of a shadow control cell use. 

TABLE 0 : Conn e ction ID for Ghadow Control C e ll 



Multicast b tt-^ P ri o r ity 



bi t 22 : 15 



bit 14:0 



OxrE(hcx) 



Rese r ved 



Data coming into the DID bus and out of the DOB bus — irs- 

assumed to be filled onto the busses from most significant bit to 
least significant bit — (highest number bit to lowest number bit) . 

10 The Dtriper AOIC accepts data from the ingress port via 

the Input Bus — (BIB) — (also known as DIN_OT_bl_ch bus) . 

This bus can either operate as — 4 — separate — 32 bit — input 

buses — (4xOC40c) or a single 120 bit wide data bus with a conmon set 
of control lines to all stripers. This bus supports either cells or 
15 packets — based — on — software — configuration — of — t+re — striper — chip , — it: 
consists of the following signals: 

BIB CI GC k ^ This c 1 oc k is sourced by the Otriper AOIC at 

up to — 100 MIIz — and is — used as — a — reference — £t3t — data — &Trdt 
control signals on the BIB. 

2 0 BIB^BB: — This — signal — ts — asserted — ( low) — t^^ — indicate the 
striper — AGIC — cannot — take — data — cm — th^ — btts — dtre — to — a- 
bandwidth — difference — between — fel-re — BiB — and — SiB — busses . 
Interfaces — which — run below — 93 MIIz — will — never — se^ — this 
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signal asserted. — At 100 Mhz^ — this signal is asserted if 
more — than — 05530 — bytes — erf — back -' to back — data — aife — given . 
This — signal — should be — sampled at — the — start — of packet. 
During a packet transfer^ this signal will be asserted if 
5 the FIFO conditions would cause DP if the packet ended on 

ttre — current — clock cycle . If DP is — asserted the — clock 

cycle after the EOF, — the striper will effectively ignore 
the input bus until the DP indication is withdrawn. — Wre 
packet ingress stage should repeat the first word of the 
10 next packet transfer and then proceed with the — rest of 

the packet after the DF signal goes away. 

DID_Valid_L : — This active low input signal delimits valid 

data on the DID_GOP, — DID_EOP, — and DID_DATA busses. — ^ 
this — signal — rs — active — tt-re — busses — are — assumed — to — be 
15 valid . — If high^ — the busses are treated as having invalid 

data for the current clock cycle. — If a transfer is not in 
progress — ^no — GOP without — EOF has been — given) — then the 
data bus is treated as invalid even if this signal is a 
one . — For cell interfaces^ this signal can be tied active. 

2 0 DID_Cell_Fkt : — This — signal is set to a one to indicate a 

cell transfer and ~a zero to indicate a packet transfer. 
Signal needs to be valid the same clock cycle as start of 
cell . 

DID_Data [127 : 0] : This is the input 120 -- bit data bus. If 

25 running in 32 bit mode^. — a cell consists of a 4 byte RW, 

a 4 byte Header^, — and twelve 4 byte data words. — A packet 
has a RW and N data words ^ where 1 < N. — If running in 120 
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bit mode, — a cell has a 4 byte RW, — a 4 byte header, — and 0 
bytes of data in the first word, 2 words with IC bytes of 
data, — and a final word with 0 bytes of data, — if the data 
starts on a word boundary. A following cell can start on 
5 the half word boundary and have all — fields — offset by — 8- 

bytes . — Packets in 120 bit mode work in the same fashion 
as 32 bit mode, — except that EOF and OOF can have larger 

values. Minimum packet length supported is IG bytes. iiE 

half word — boundary — cell — starts — are — used, — t+te — correct 
10 value — (0/4 ) — needs to be given on the OOF bits 3:0. 



DID_EOr [4 : 0] : This bus has two fields. Bit 4 is a one to 

indicate an EOF on the current transfer — (if DID_Valid_L 
is active) . — Brt — 4 — — a — zero to indicate no EOF on the 
current transfer . — Bits — 3 : 0 give the offset of the last 

15 byte which is valid. The EOF field is not utilized for 

cell transfers. 



BlB_0Or/C[l: 0) :This bit indicates a start of packet or 

cell on the current bus cycle (if BI0_Valid_L is active) . 
A value of zero indicates start of transfer, — a value of 

2 0 one indicates no start of transfer. — Asserting bit 1-1 

indicates — that — the — upper — &4 — bits — carries — the — SOf — attd 
asserting — birt — &=i — indicates — that — the — lower — &4 — bits 

carries the OOF — (for 120 bit bus only) . For the 32 bit 

bus, OOr(O) should be used, OOr(l) >should be tied high. 

25 For the — i9rB — bit bus, — if a packet ends — in the — upper — &4 

bits of the bus, — a new packet can begin at bit — 6^ 
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DID_CONN(24:0) .Thi^ an optional bus. It can be u>5t;J 

to pass a routeword to the — striper A3IC to use as — the 
fabric routeword^ — or the routeword can be transferred as 
the most significant 32 bits of the first word of data ' . - 

5 The data should be valid the same cycle as GOr/C. 5%e 

value — during — non OOF/C — cycles — — a — don^ t — care . 

interface — irs — statically — configured — to — either — trse — trh^ 
separate connection number bus or to expect the routeword 
on the data bus. 

10 Figure 5 shows a 32 bit DID cell transfer. 

Figure — G shows a DID back pressure. 

Figure — ? — shows — a — 35 — bit — BiB — packet — transfer — using 

external connection number bus. 

The unstriper ASIC sends data to the egress port via 
15 Output Bus (BOB) (also known as DOUT_UN_bl_ch bus), which is a 64 
(or 256) bit data bus that can support either cell or packet. It 
consists of the following signals: 

This bus can either operate as 4 separate 32 bit output 
buses (4xOC48c) or a single 128 bit wide data bus with a common set 
20 of control lines from all Unstripers. This bus supports either 
cells or packets based on software configuration of the unstriper 
chip. It consists of the following signals: 
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DOD_ClQCk : This clock is sourced from the unstriper AGIC 
at up to 100 MHz and is used as a reference for data and 
control signals on the DOB. 



DOD_DF: — This — active — low input — signa - i — indicates whether 

5 data — cai*i — be transferred ( inactive - ) or cannot be- 

transferred — (active) . When — back '- pressure — ts — asserted^ 

^bhe — unstriper — will — stop — advancing — fe+re — output — btts — smd- 
signal — data — irs — not — valid — using — t+re — DOD_valid — signal . 
Gince synchronization must be done on both sides of the 
10 interfaces;. — 0 clock cycles of data must be allowed from 
the assertion of DF to data stopping. — The source driving 
DOD_DP cannot make any assumptions on the data stopping 
or restarting except by examining DOD_Valid. 



DOD_Valid_L : — This — active — iisrw — output — signal — indicates 

15 whether the bus has valid data or not during a transfer. 

This signal indicates invalid data only when DQD_DF has 
been asserted- 



DOD_Data : This is the output bit data bus. — It can either 

be — G4 bits wide or — 25G bits — wide . — If running — im — &4 — bit- 

2 0 mode, — a cell consists of a word with a 4 byte RW and a 4 
byte Header followed by G data words. A packet has a RW 
and N data words, where 1 < N. — If running in 2 5G bit mode 
and a cell — starts — on an even — 32 — byte — word boundary, — & 
cell has a word with a 4 byte RW a 4 byte header and 24 

2 5 bytes of data in the first word, — and a second word with 

24 bytes of data. A following cell can start on the next 
used byte and have all — fields offset by 0 bytes. — Valid 
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cell — start locationg are all multiples of 0 — (-6-? — 8-; — 3rG-r 
24 ) . — Packets in 120 bit mode work in the same fashion as 
•32 — bit mode, — e:^cept — that — Ee¥ — and GOP — can have — larger 

values. Minimum packet length supported is IG bytes, 

5 half "v/ord — boundary — cell — starts — mre — used, — th^ — correct 

value — ( 0 / 4 ) — needs to be given on the OOP bits 3,0. 

DOD_EOP : — This bit is asserted when the last transfer of 

a packet is occurring. 

DOD_Cell_Pkt : — This signal is set to a one to indicate a 

10 cell transfer and a zero to indicate a packet transfer. 

Gignal needs to be valid the same clock cycle as start of 
cell . 

DOD_0OP/C This bit — irs — a — t^tto — bo indicate a — start of 

packet or cell on the current bus cycle. Data is always 

15 assumed to start at the most significant bit of the bus. 

Figure 0 shows a G4 bit DOB cell transfer, 

Figure — 9 — shows a — G4 bit DOB packet transfer. 

Figure 10 shows an overview of the datapath of the switch 

AOICs. 

2 0 — data — on — the — data — bns — transports — art — optional — byte 

count — (32 bit word, — lower IG bits are the byte count) — and a 32 bit 
egress — routeword. ¥he — unstriper — core — will — always produce — a byte 
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count . i-f — a — segmentation — engine — ts — used to break — ttre — packet — tip 

into cells, — then the segmentation engine will drop the byte count 

word before it is given to the cell — interface . This dropping is 

only supported in OC40 mode. In OC192 mode, — the chipset will have 

5 no provisions for segmentation and dropping the byte count word. 



TABLE 9 : OC4 0 DOB format 



QC48 Bits QC192 bi t s fcabd t^sagc 

63 : 48 255:240 Unused reserved for unstri p e r use 

47:32 239 : 224 Byte coun t G i ves the count of the number of bytes in the packet 

no t count i ng the 4 b>'tcs for the egress routeword and 
the bytes fo r the byte cou n t (basically, th i s co rr esponds 
to the byte count of the received packet plus/m i nus any 
changes for rccneapsulatio n , p ushes, or pops.) 
10 223: 192 Eg r e s s RW Routewo r d fo r the egress memory controller 

Next b i ts start the data (bits (191 to 0) for 192, next 
clock cycle for QC 4 8 



The Synchronizer has two main purposes. The first 
purpose is to maintain logical cell/packet or datagram ordering 
across all fabrics. On the fabric ingress interface, datagrams 
arriving at more than one fabric from one port cards 's channels 

15 need to be processed in the same order across all fabrics. The 
Synchronizer's second purpose is to have a port cards 's egress 
channel re-assemble all segments or stripes of a datagram that 
belong together even though the datagram segments are being sent 
from more than one fabric and can arrive at the blade's egress 

20 inputs at different times. This mechanism needs to be maintained in 
a system that will have different net delays and varying amounts of 
clock drift between blades and fabrics. 



start 



The switch 
information is 



uses a system of a synchronized windows where 
transmit around the system. Each transmitter 
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and receiver can look at relative clock counts from the last 
resynch indication to synchronize data from multiple sources. The 
receiver will delay the receipt of data which is the first clock 
cycle of data in a synch period until a programmable delay after it 
5 receives the global synch indication. At this point, all data is 
considered to have been received simultaneously and fixed ordering 
is applied. Even though the delays for packet 0 and cell 0 caused 
them to be seen at the receivers in different orders due to delays 
through the box, the resulting ordering of both streams at receive 
10 time = 1 is the same. Packet 0, Cell 0 based on the physical bus 
from which they were received. 

Multiple cells or packets can be sent in one counter 
tick. All destinations will order all cells from the first 
interface before moving onto the second interface and so on. This 
15 cell synchronization technique is used on all cell interfaces- 
Differing resolutions are required on some interfaces. 

The Synchronizer consists of two main blocks, mainly, the 
transmitter and receiver. The transmitter block will reside in the 
Striper and Separator ASICs and the receiver block will reside in 
20 the Aggregator and Unstriper ASICs. The receiver in the Aggregator 
will handle up to 24(6 port cards x 4 channels) input lanes. The 
receiver in the Unstriper will handle up to 13(12 fabrics + 1 
parity fabric) input lanes. 

When a sync pulse is received, the transmitter first 
25 calculates the number of clock cycles it is fast (denoted as N 
clocks) . 
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The transmit synchronizer will interrupt the output 
stream and transmit N K characters indicating it is locking down. 
At the end of the lockdown sequence, the transmitter transmits a K 
character indicating that valid data will start on the next clock 
5 cycle. This next cycle valid indication is used by the receivers 

to synchronize traffic from all sources. R e fer — fets — ^ — character 

usage*"^ on page 34 for the mapping of K characters to the functions. 

At the next end of transfer, the transmitter will then 
insert at least one idle on the interface. These idles allow the 
10 10 bit decoders to correctly resynchronize to the 10 bit serial 
code window if they fall out of synch. 

The receive synchronizer receives the global synch pulse 
and delays the synch pulse by a programmed number (which is 
programmed based on the maximum amount of transport delay a 
physical box can have) . After delaying the synch pulse, the 
receiver will then consider the clock cycle immediately after the 
synch character to be eligible to be received. Data is then 
received every clock cycle until the next synch character is seen 
on the input stream. This data is not considered to be eligible 
for receipt until the delayed global synch pulse is seen. 

Since transmitters and receivers will be on different 
physical boards and clocked by different oscillators, clock speed 
differences will exist between them. To bound the number of clock 
cycles between different transmitters and receivers, a global sync 
25 pulse is used at the system level to resynchronize all sequence 
counters. Each chip is programmed to ensure that under all valid 
clock skews, each transmitter and receiver will think that it is 
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fast by at least one clock cycle. Each chip then waits for the 
appropriate number of clock cycles they are into their current 
sync_pulse_window. This ensure that all sources run 

sync_pulse_window valid clock cycles between synch pulses. 



5 As an example, the synch pulse window could be programmed 

to 100 clocks, and the synch pulses sent out at a nominal rate of 
a synch pulse every 10,000 clocks. Based on a worst case drifts 
for both the synch pulse transmitter clocks and the synch pulse 
receiver clocks, there may actually be 9,995 to 10,005 clocks at 

10 the receiver for 10,000 clocks on the synch pulse transmitter. In 
this case, the synch pulse transmitter would be programmed to send 
out synch pulses every 10,006 clock cycles. The 10,006 clocks 
guarantees that all receivers must be in their next window. A 
receiver with a fast clock may have actually seen 10,012 clocks if 

15 the synch pulse transmitter has a slow clock. Since the synch 
pulse was received 12 clock cycles into the synch pulse window, the 
chip would delay for 12 clock cycles. Another receiver could seen 
10,005 clocks and lock down for 6 clock cycles at the end of the 
synch pulse window. In both cases, each source ran 10,100 clock 

20 cycles. 



When a port card or fabric is not present or has just 
been inserted and either of them is supposed to be driving the 
inputs of a receive synchronizer, the writing of data to the 
particular input FIFO will be inhibited since the input clock will 
25 not be present or unstable and the status of the data lines will be 
unknown. When the port card or fabric is inserted, software must 
come in and enable the input to the byte lane to allow data from 
that source to be enabled. Writes to the input FIFO will be 
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enabled. It is assumed that, the enable signal will be asserted 
after the data, routeword and clock from the port card or fabric 
are stable. 

At a system level, there will be a primary and secondary 
5 sync pulse transmitter residing on two separate fabrics. There 
will also be a sync pulse receiver on each fabric and blade. This 
can be seen in Figure [[11]] 5,. A primary sync pulse transmitters 
will be a free-running sync pulse generator and a secondary sync 
pulse transmitter will synchronize its sync pulse to the primary. 

10 The sync pulse receivers will receive both primary and secondary 
sync pulses and based on an error checking algorithm, will select 
the correct sync pulse to forward on to the ASICs residing on that 
board. The sync pulse receiver will guarantee that a sync pulse is 
only forwarded to the rest of the board if the sync pulse from the 

15 sync pulse transmitters falls within its own sequence "0" count. 
For example, the sync pulse receiver and an Unstriper ASIC will 
both reside on the same Blade. The sync pulse receiver and the 
receive synchronizer in the Unstriper will be clocked from the same 
crystal oscillator, so no clock drift should be present between the 

20 clocks used to increment the internal sequence counters. The 
receive synchronizer will require that the sync pulse it receives 
will always reside in the "0" count window. 

If the sync pulse receiver determines that the primary 
sync pulse transmitter is out of sync, it will switch over to the 
25 secondary sync pulse transmitter source. The secondary sync pulse 
transmitter will also determine that the primary sync pulse 
transmitter is out of sync and will start generating its own sync 
pulse independently of the primary sync pulse transmitter. This is 
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the secondary sync pulse transmitter's primary mode of operation. 
If the sync pulse receiver determines that the primary sync pulse 
transmitter has become in sync once again, it will switch to the 
primary side. The secondary sync pulse transmitter will also 
5 determine that the primary sync pulse transmitter has become in 
sync once again and will switch back to a secondary mode. In the 
secondary mode, it will sync up its own sync pulse to the primary 
sync pulse . The sync pulse receiver will have less tolerance in 
its sync pulse filtering mechanism than the secondary sync pulse 
10 transmitter. The sync pulse receiver will switch over more quickly 
than the secondary sync pulse transmitter. This is done to ensure 
that all receiver synchronizers will have switched over to using 
the secondary sync pulse transmitter source before the secondary 
sync pulse transmitter switches over to a primary mode. 

15 Figure [[11]] 5, shows sync pulse distribution. 



In order to lockdown the backplane transmission from a 
fabric by the number of clock cycles indicated in the sync calcu- 
lation, the entire fabric must effectively freeze for that many 
clock cycles to ensure that the same enqueuing and dequeueing 
20 decisions stay in sync. This requires support in each of the 
fabric ASICs. Lockdown stops all functionality, including special 
functions like queue re synch . 

The sync signal from the synch pulse receiver is 
distributed to all ASICs. Each fabric ASIC contains a counter in 
25 the core clock domain that counts clock cycles between global sync 
pulses. After the sync pulse if received, each ASIC calculates the 
number of clock cycles it is fast. (5). Because the global sync is 
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not transferred with its own clock, the calculated lockdown cycle 
value may not be the same for all ASICs on the same fabric. This 
difference is accounted for by keeping all interface FIFOs at a 
depth where they can tolerate the maximum skew of lockdown counts. 

5 Lockdown cycles on all chips are always inserted at the 

same logical point relative to the beginning of the last sequence 
of "useful'' (non-lockdown) cycles. That is, every chip will always 
execute the same number of "useful" cycles between lockdown events, 
even though the number of lockdown cycles varies. 

10 Lockdown may occur at different times on different chips. 

All fabric input FIFOs are initially set up such that lockdown can 
occur on either side of the FIFO first without the FIFO running dry 
or overflowing. On each chip-chip interface, there is a sync FIFO 
to account for lockdown cycles (as well as board trace lengths and 

15 clock skews) . The transmitter signals lockdown while it is locked 
down. The receiver does not push during indicated cycles, and does 
not pop during its own lockdown. The FIFO depth will vary 
depending on which chip locks down first, but the variation is 
bounded by the maximum number of lockdown cycles. The number of 

20 lockdown cycles a particular chip sees during one global sync 
period may vary, but they will all have the same number of useful 
cycles. The total number of lockdown cycles each chip on a 
particular fabric sees will be the same, within a bounded 
tolerance . 
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The Aggregator core clock domain completely stops for the 
lockdown duration - all flops and memory hold their state. Input 
FIFOs are allowed to build up. Lockdown bus cycles are inserted in 
the output queues. Exactly when the core lockdown is executed is 
5 dictated by when DOUT_AG bus protocol allows lockdown cycles to be 
inserted. DOUT_AG lockdown cycles are indicated on the DestID bus. 

The memory controller must lockdown all flops for the 
appropriate number of cycles. To reduce impact to the silicon area 
in the memory controller, a technique called propagated lockdown is 
10 used. 

The aggregator signals lockdown cycles on the DIN_ME bus. 

Th^ — memory — controller — does — not — push — during — these — cycles . Wte 

memory controller does not pop during lockdown to account for the 

non-push cycles . Hte FIFO depth drs during fabric 

15 synchronization to tolerate getting deeper or shallower depending 
on who locks down first. 

Lockdown idle cycles are inserted on the BOUT and CII_ID 

busses . — An extended sync signal is used to indicate the number of 
lockdown cycles on the DOUT_ME bus to aid the Separator^ s lockdown 
2 0 function . 

The token bus lockdown looks the same as the DIN ME bus 

from a memory controller perspective. — Non - push cycles are signaled 

by — ttre — separators — according — to — their — lockdowns . ¥he — memory 

controller does not pop during lockdown. The Separator locks down 

2 5 completely in a manner similar to the Aggregator. — DIN_3F and CII ID 
lockdown — cycles — a-re — signaled — individually — per bus — VT-a — the — GYNC 
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signals . ^my — continuous — SYNC — assertion — after — tim — first — one — i^s- 

considered — a — lockdown — cycle . Lockdown — brrs — cycles — a-r^ — rttyt — pushed 

into the input FIFOs. 

Wte — chip- ' to -' Chip — communication — within — a — single — fabric 

5 must — be — synchronized. Although — rns — clock — drift — exists — between 

chips f differences — in — track — delays — cause — data — tro — arrive — &t- 

different — Memory — Controllers — at — different — times . ^i-i — Memory 

Controllers need to process — incoming packets — in exactly the same 
logical order on each chip. — The Separators must align and combine 
10 multiple data slices coming from different Memory Controllers. — 54^ 
Memory — Controllers — must — take — the — tokens — received — from — ttre 
Separators and apply them at exactly the same point in the logical 
packet floW;. — or drop decisions may differ — from chip to chip. 



The on-fabric chip-to-chip synchronization is executed at 
15 every sync pulse. While some sync error detecting capability may 
exist in some of the ASICs, it is the Unstriper' s job to detect 
fabric synchronization errors and to remove the offending fabric. 
The chip-to-chip synchronization is a cascaded function that is 
done before any packet flow is enabled on the fabric. The 
20 synchronization flows from the Aggregator to the Memory Controller, 
to the Separator, and back to the Memory Controller. After the 
system reset, the Aggregators wait for the first global sync 
signal. When received, each Aggregator transmits a local sync 
command (value 0x2) on the DestID bus to each Memory Controller. 



25 The Memory Controllers do not push anything into a DIN 

input FIFO until the first sync command is seen on that bus. The 
sync and every bus cycle following is constantly pushed into the 
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input FIFO. On the core side of the input FIFOs, no FIFO is popped 
until a sync appears in the FIFO from every Aggregator. After two 
additional margin cycles, every input FIFO is popped every cycle. 
After this point the input FIFO depths remain constant. The depths 
5 are roughly a function of the track delays from each Aggregator. 
Immediately after the Memory Controllers begin sampling the 
Aggregator input FIFOs, a sync signal (S_SYNC_L) is transmitted to 
all Separators on the DOUT and CH__ID busses. 

Like the Memory Controllers, the Separators do not push 
10 into the DIN and CH_ID busses until a sync signal is received on 
that bus. The sync and everything after is constantly pushed into 
the input FIFO. 

On the core side the Separator always waits until at 
least one word is present on all input busses, and then pops the 
15 CH_ID and DIN busses simultaneously. This will logically align the 
data stripes coming from the Memory Controllers. After the first 
combined sync is popped from the input FIFOs, the Separators send 
a sync signal on the TOKEN bus to the Memory Controllers. 

The Memory Controllers do not push into the TOKEN bu5 
2 0 input — FIFO until — a — sync — signal — (0x3F on the — token bus) — has been 

seen on the bus. The sync and all subsequent tokens and idles are 

always pushed. 

All Memory Controllers need to apply the received tokens 

tro — t+re — same — point — in — the — incoming — logical — flow — in — order — f-orr — a^rt 

2 5 drop decisions to be identical. This is done by waiting a worst 

case number of clock cycles after the Separator sync transmission 
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before beginning to pop the token input FIFO. — The worst case dela y 
must be used because there is no way for a single Memory Controller 
to know exactly when all other Memory Controllers have received a 
token . — The programmable delay stored in the IG - bit Token Oync Wait 
5 Register — ars — im — ^^usef ul'^'^ — cycles — (12DMIlz) — that — do not — include — t+re- 

fabric — lockdown — cycles . — worst — case — delay — itrs — ttre — worst — case 

skew for all data paths going from the Aggregator to Memory Con ■ 
troller to Separator and back to Memory Controller. 

The following Table 10 gives the min/max delays which the 

10 chipset — supports and represent the limits of what — is verified in 
the chip verification pr 

Oync — pulse — transport — delay — from — Transmitter — t o " a ny 

individual chip receiving the sync pulse — (WC path DC path) : — 5r&6- 

rtS — (min delay of — Q~, — max delay of — — nO) . At — 17D ps/inch, — this 

15 works out to a difference of about 70m. — Backplane transport delay 
difference from local sync pulse receipt to reception of the sync 

indication — flag — by — th^ — — ertd — chips : — frG-e — nS-: Note — that — i± — ir^ 

desired — to — allot — about — 2-5 — rrS — crf — this — tX) — the — chip — synchronize r 
operation which gives a delta path delay supported of 500 nO . 

2 0 Oscillators should — be ppm — oscillators . Vrr^ 

assumption of the design was that the difference in transmission 

path delay was less than or equal to clock drift. On board delays 

between chips have been designed to exceed the following specs: 

Shortest net : — 0.25^'';. — transport delay of pretty much 0. 
2 5 Longest net : — 25^^, — transport delay is 5 nO. 
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For any signal distribution 


■: — The net delta delay between 


chips — ts — a multiplier — trf — the — number 


"crf — busses — ttre — sync — ha-s — tra- 


versed. Omce the sync goes through 


a -reee^irve synchronization to 


the — local — clock of the — chip, — gm — 
added at — each — stage — giving a net unc 


— 8 — rrS — uncertainly has — to be 
ertainty of around 21 nO for 



each hop. 



TABLE 10 : — Fabric sync d e lay 



10 



15 



Chip 

Memory 
controller 

Sep DIN 



memo r y 
cont r o l ler 
token in 



Number ofSkcw 
busses 

4" i2 \ ' nS 



UJ lliJ 



Notes 

Syne pulse in 

Gyne pulse to agg i agg_mc del t a 



Syne p ulse to agg i agg_me i me^sep 
(note this sync pulse is delayed by the 
memory — cont r oller — for propagated 
l oekdown). 

eve r ything above i se p ^me tokens. 



Wte — control — port — follows — the — same — cell — flow — a^ — ttre 

regular — ports . — Wte — switch — control — processor — sends — cells — to — ttre 
striper AGIC; — the striper stripes the cells and route words across 
2 0 a^ri — fabrics . — An additional aggregator — ( 9th) — AGIC sends cells via 
the DOUT_AG/DestID buses to all 12 memory controllers. Each memory 
controller AOIC has an additional 9th DIH_ME_f b_se_9 bus- 

Wte — memory — controller — AGIC — will — route the — incoming 

control — port — cells — to — arry — one — of — t+te — control — port — destination 
2 5 queues and blade queues (up to 19G queues) . The Qth D0UT_ME_fb_se_9 
bns — i-s — used to — send the — control — cells — to the — 9th separator AGIC; 
which — sends — the — cells — to — one — of — several — destination — unstriper 
AGICs . — Wte — unstriper — AGIC — reconstructs — the — cells — from — sriri — 9th- 



separator AGIC5 across all fabrics. — It sends the complete control 
cells to the switch control processor it is connected to. 

Note that the control port destination queues can be part 

of any multicast cells such that the multicast port mask is neces ' 
5 sary — tr© — include — additional — bit (s) — tn3 — indicate — t+re — control — port 
queue (s) . — 

There — grre — srt — most — A — control — ports — ttl — grtry — switch 

configurations . — This — limitation — — dtte — tT3 — — aggregator — 
separator AGICs only have 4 — 12 - bit channels which can be scalable 

10 to different switch configurations — respectively . — In other words ^ 

btJ-s DIN_AG_fb_0_l__l, DIN_AG_f b_9_2_l , DIN_AG_f b_9_3_l , artci 

DIN_AG_fb_D_4_l — &f — the — aggregator AOIC are connected to up — — ^ 
control port striper AGICs. Bus DOUT_Or_f b_9_l_l , DOUT__gP_f b_9_2_l , 
DOUT_GF_f b_Q_3_l , — and DOUT_Gr_f b_9_4_l of the separator AGIC are 

15 connected to up to 4 — control port unstriper AGICs. 

The striping function assigns bits from incoming data 
streams to individual fabrics. Two items were optimized in deriving 
the striping assignment: 

1. Backplane efficiency should be optimized for OC48 
20 and OC192. 

2. Backplane interconnection should not be 
significantly altered for OC192 operation. 



These were traded off against additional muxing legs for 
the striper and unstriper ASICs. Irregardless of the optimization. 
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the switch must have the same data format in the memory controller 
for both OC48 and OC192. 

Backplane efficiency requires that minimal padding be 
added when forming the backplane busses. Given the 12 bit backplane 
5 bus for OC48 and the 48 bit backplane bus for OC192, an optimal 
assignment requires that the number of unused bits for a transfer 
to be equal to (number_of_bytes ^8 ) /bus__width where is integer 

division. For OC48, the bus can have 0, 4 or 8 unutilized bits. For 
OC192 the bus can have 0, 8, 16, 24, 32, or 40 unutilized bits. 

10 This means that no bit can shift between 12 bit 

boundaries or else OC48 padding will not be optimal for certain 
packet lengths. 

For OC192c, maximum bandwidth utilization means that each 
striper must receive the same number of bits (which implies bit 
15 interleaving into the stripers) . When combined with the same 
backplane interconnection, this implies that in OC192c, each stripe 
must have exactly the correct number of bits come from each striper 
which has 1/4 of the bits. 

For the purpose of assigning data bits to fabrics, a 48 
20 bit frame is used. Inside the striper is a FIFO which is written 32 
bits wide at 80-100 MHz and read 24 bits wide at 125 MHz. Three 32 
bit words will yield four 24 bit words. Each pair of 24 bit words 
is treated as a 48 bit frame. The assignments between bits and 
fabrics depends on the number of fabrics. 
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TABLE 11: Bit striping function 
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The following tables give the byte lanes which are read 
first in the aggregator and written to first in the separator. The 
10 four channels are notated A,B,C,D. The different fabrics have 
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different read/write order of the channels to allow for all busses 
to be fully utilized. 



One fabric-40G 



The next table gives the interface read order for the 
5 aggregator. 
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Interfaces to the gigabit transceivers will utilize the 
20 transceiver bus as a split bus with two separate routeword and data 
busses. The routeword bus will be a fixed size (2 bits for OC48 
ingress, 4 bits for OC48 egress, 8 bits for OC192 ingress and 16 
bits for OC192 egress), the data bus is a variable sized bus. The 
transmit order will always have routeword bits at fixed locations. 
25 Every striping configuration has one transceiver that it used to 
talk to a destination in all valid configurations. That 
transceiver will be used to send both routeword busses and to start 
sending the data. 
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The backplane interface is physically implemented using 
125 MHz interfaces to the backplane transceivers. The 125 MIIz bus 
for both ingress and egress is viewed as being composed of two 
halves, each with routeword data. The two bus halves may have 
5 information on separate packets if the first bus half ends a 
packet. 

For example, an OC48 interface going to the fabrics 
locally speaking has 24 data bits and 2 routeword bits @125 MIIz . 
This bus will be utilized acting as if it has 2x (12 bit data bus 
10+1 bit routeword bus) . The two bus halves are referred to as A 
and B. Bus A is the first data, followed by bus B. A packet can 
start on either bus A or B and end on either bus A or B. 

In mapping data bits and routeword bits to transceiver 
bits, the bus bits are interleaved. This ensures that all 
15 transceivers should have the same valid/invalid status, even if the 
striping amount changes. Routewords should be interpreted with bus 
A appearing before bus B, 

The bus A/Bus B concept closely corresponds to having 9r5^ 
Mfe interfaces between chips. 

20 All backplane busses support fragmentation of data. The 

protocol used marks the last transfer (via the final segment bit in 
the routeword) . All transfers which are not final segment need to 
utilize the entire bus width, even if that is not an even number of 
bytes. Any given packet must be striped to the same number of 

25 fabrics for all transfers of that packet. If the striping amount 
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is updated in the striper during transmission of a packet, it will 
only update the striping at the beginning of the next packet. 

Each transmitter on the ASICs will have the following I/O 
for each channel: 

5 8 bit data bus, 1 bit clock, 1 bit control. 

On the receive side, for channel the ASIC receives 

a receive clock, 8 bit data bus, 3 bit status bus. 

The switch optimizes the transceivers by mapping a 
transmitter to between 1 and 3 backplane pairs and each receiver 
10 with between 1 and 3 backplane pairs. This allows only enough 
transmitters to support traffic needed in a configuration to be 
populated on the board while maintaining a complete set of 
backplane nets. The motivation for this optimization was to reduce 
the number of transceivers needed. 

15 The optimization was done while still requiring that at 

any time, two different striping amounts must be supported in the 
gigabit transceivers. This allows traffic to be enqueued from a 
striping data to one fabric and a striper striping data to two 
fabrics at the same time. 

2 0 -in — all modes — — operation, — the — entire — 3 . OG — of — data — rs- 



always supported on 


switch ingress. — 


For egress 


operation, — for 




■Send — OOG, — Hte — number 
speedup — wsrs — deemed 


of transceivers 
-bo — expensive , 


needed to 
■f^t^r — these 


support — a — full 
switch — modes, — 


the 
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Qutput Speedup is between 1.5 and 2. — All configurations above OOG 
support a full 2x speedup. 



Depending on the bus configuration, multiple channels may 
need to be concatenated together to form one larger bandwidth pipe 
5 (any time there is more than one transceiver in a logical 
connection. Although quad gbit transceivers can tie 4 channels 
together, this functionality is not used. Instead the receiving 
ASIC is responsible for synchronizing between the channels from one 
source. This is done in the same context as the generic 
10 synchronization algorithm. 



The 8b/10b encoding/decoding in the gigabit transceivers 
allow a number of control events to be sent over the channel. The 
notation for these control events are K characters and they are 
numbered based on the encoded 10 bit value. Several of these K 
15 characters are used in the chipset. The K characters used and 
their functions are given in the table below. 



TABLE 12: K Character usage 



K character Function Notes 

28.0 Sync indication Transmitted after lockdown cycles, treated as the prime 

synchronization event at the receivers 
2 0 28. 1 Lockdown Transmitted during lockdown cycles on the backplane 

28.2 Packet Abort Transmitted to indicate the card is unable to finish the 

current packet. Current use is limited to a port card 
being pulled while transmitting traffic 

28.3' Resync window Transmitted by the striper at the start of a synch 

window if a resynch will be contained in the current 
sync window 

28.4 BP set Transmitted by the striper if the bus is currently idle 

and the value of the bp bit must be set. 

28.5 Idle Indicates idle condition 

2 5 28.6 BP clr Transmitted by the striper if the bus is currently idle 

and the bp bit must be cleared. 
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The switch has a variable number of data bits supported 
to each backplane channel depending on the striping configuration 
for a packet. Within a set of transceivers, data is filled in the 
following order: 

5 F [ fabric] _ [ocl92 port number] [oc48 port designation 
(a,b,c,d) ] [transceiver_number ] 

Everything — in the documentation — ars — done — f-crr — f abric-1, 
which is the case where all connections are needed. — The only part 
of this which is used for fill order is tran5ceiver__number — (QC4 0 ) 
10 and transceiver number and oc40 port designati on itO'Il' OC192. 

The fundamental rules for mapping are the following: 

i-: &f — i — RW are on transceiver 1 — These always occupy the first 4 

bits of the transceiver. 

•2-: Data bits — starting with the least — significant bit are filled 

15 into the data bus in a 2 bit bit -interleaved pattern, — with bus A 
and bus D pairs. 

3- . — Transceivers are filled in starting at bit 0 of their transmit 
and receive interfaces. 

4- ^ — All multibit routeword fields are transmitted LDD to MGD. — This 
2 0 includes connection number, number of fabrics and encoded values of 

stop/align/ final — segment . 54te — overall — routeword — i-s — notated — srs- 

starting from bit 0 — (least significant bit) and up. — Transmit order 
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is Bit 0 — ( OOP) — goes on the first routeword bit;. fcjllowed by bit 1 

(Packet type) . If multiple routeword bits are transmitted in the 

same clock they are filled in starting with the first bit going to 
bit 0;, — the second bit going to bit 1. 

5 -Sn^ Data — should be — encoded — mrd — decoded based — cm — a — hrxs — A/Bus — & 

order , 

■6^ Fo-r — OC192, — the — fill ordex — should be — bus A, — B7 — 67 — B — 

routeword bits. Fcrr — data — bits , — the — fill — order — depends — on — wack - 

ing/unwacking/reverse unwacking and reverse wacking functions. 

10 Transceiver 1 

For an ingress bus^ — the format of data is the following: 



Brtr 

Bd-t- 




—Bf" 






— e- 


Bit- 




RWA 


B±t- 




RWD 


Bi±- 




■Dataa (0) 


Bit- 


-&- 


Dataa (1) 


BiHr- 

Bit- 


-€- 
-=h 


Datab (0) 
Datab(l) 



2 0 Note that — for — 3r2 — fabric mode^ — bits — D and 7 — are unused. 

The location of datab(O) — does not change . 



For — the — egress — bus^ — the — format — erf — the — data — irs — the- 

following : 

Bit 0 — RWA ( 0 ) 
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&rt- 

&i-fc- 




RWA(l) 






RWD ( 0 ) 


B±tr 




RWD ( 1 ) 


Bit- 


-4- 


Dataa (0) 





-&- 


Dataa (1) 


Bit- 

Bit- 


-6- 


Datab (0) 
Datab (1) 



Transceiver 2 and up 

Fill up the data bus starting at each transceiver bit — 9- 

10 to bit 7 with 2 bit interleaved 
dataa/datab patterns . — 

For example, — transceiver 2 has the following pattern: 



&rt- 


-e- 


dataa (2) 


Bit- 




— dataa (3) 


Brb- 


-2- 


datab (2) 


frtt- 




datab (3) 


Bit- 


r 


■Dataa (4) 


En±r 




■■Dataa (5) 


Bit- 

Barb- 


-6- 


Datab (4) 
Datab (5) 



The stop/align encoding depends on the width of the bus interface. 



a?ADLE 13: OC4 0 portcard to fabric rout e word stop/align 



1' ICIU 


Length 


Function 


Stop/Align 


2 1 n (where 
n is the 


In this mode, this field is stop & align & final^segment. 


clock cycles 


Stop bit is a 1 to indicate no stop, zero indicates stop. Stop bits repeat in a serial stream until a 


of transfer) 


stop bit of gero is seen, followed by the align bit and FS. Since stop is followed by the align and 




FS bits, the stop bit is given 2 clock cycles before the end of data. 


Align bit is a one to indicate valid data on the last connpletc byte on the interface. For odd 12 bit 


words(assuming zero based counting), align ■■' 0 indicates bits 0.3 are valid, and bits 4:1 1 arc 
invalid. Align ■ 1 for these words indicates that all 1 2 bits are valid. For even words, ali^n should 
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normally be a 1. 


Short packets ai-c indicated by signaling a 3top on byte 53 of the trunsfci'. In reality, 54 bytes will 


be transferred, but the packet is flagged as a short packet. 


Final segment is a one to indicate a final segment of a packet and a zero to indicate a partial 


segment of a packet. Only one packet can be in transit at any one time on this bus. This bit is only 


valid for packets. Tor cells this bit should be a one. Packets which are not final segments should 


be terminated only on odd cycles with all bits utilized. 









TABLE 14 : OC192 portcard to fabric rout e word stop/align 



Leng t h 



Func t ion 



itop/A li gn 
TO 



numbe r of 
ext r a clocks 



Due to length r est ri c t ions on this bus, the stop/al i gn has to be trea t ed diffe r ently than for QC4G 
transfers. 



The f i rst clock cycle, th i s field is 3 bits long and is notatcd as SAFO. In all future clock cycles the 



stop field is 4 b i ts long and notatcd SAFl. The defin i tions of SAFQ and SAFl are given below 



SAFQ(O). Bit zero is a ze r o to indica t e a s t op, a one to ind i cate no stop. 

SAF0(2!lK00" i nd i cates full word t ransfer. 

'01" in d i cates a full word transfe r bu t fo r a short packet. 

MO'' i nd i cates a full word transfe r bu t not the final segment. 

M 1" is reserved. 



jAF1(0) Bit zero is a ze r o t o ind i cate a stop, a one t o indicate no stop on the cur r ent cycle. 



SAFl(3 : l) - binary value of the n umbe r of val i d bytes. Zero is rese r ved and 7 is used to i ndicate 



6 bytes val i d but not the final segment. 6 i ndicates 6 bytes valid and final segment. All part i al 



word t r ansfers au t omat i cally indica t e an implied final segment. 



TABLE 15 : OC48 Fabric^Port card routcword Stop/align 



r iciu 



Leng t h 



Func t ion 



i top/A li gn 



3 1 2 Value is treated as a r e p eated 2 bit value (encoded stop) followed by the final segment bit. 

number of Sto p field is inte r p r e t ed as : 



extra clocks 



1 1 - cont i nue 



QO - lst byte finished is valid and s t op 
01 - 2nd bytes finished is valid a n d s t o p 



10 - 3rd byte finished is val i d and sto p , or n o n - final segmen t . 



Sho r t p ackets a r c indicated by flaggi n g a stop a t by t e 53. 



F in al segment i s a one for a final segmen t , a ze r o fo r a continuing packet. For final segments, 



the s t op field shou l d be encoded as a "1 -9^ 
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I I I 

The port card ^ fabric interface at OC192 variable routcword bits arc given in the table bcluw. 



TABLE 1 G : OCl 92 Tabric-port card routcword stop/align 



1' ICIU 


Length 


Function 


Stop/Align 


7 16'" numbci 


Bft 0 indicates stop. Zero indicates atop, 1 continue. 


transfer 




bitsacdentcaes aaoaiLSUwdO^Btfotes 12b^aijfids[gitit(MDiiTdicdBtlrfaDbiBatipxjalo^ 


Values OxC, Oxr arc reserved. Any non-12 byte ending offset automatically signals end of scgm 


cycle of data. 

Short packets arc indicated by flagging a stop at byte 53. 









5 Depending on the switch configuration, — the bus may not 

transfer — gm — integer — number — crf — bytes . This — i^s — handled — by — th^ 

interface always flagging the bytes which finish and the transmit 
•arrdi — receive — state — machines — must — track — where — bytes — begin — and — errd: 
based on the current cycle in the transfer - 

10 5*re — btrs — consists — crf — a — multiplexed — address/data — hrt^ 

(AD__DATA) , a select signal (AD_gCL_L) , a read/write signal (AD__RW) , 
and a bus transaction complete indication signal — (AD_RDY_L) . AD bus 
is used for read/write access of control/status registers. 

irt — order — ^ — write — tn3 — a — control /status — register, — t+re 

15 read/write signal (AD_RW) must be low. The select signal (AD_GEL_L) 
must be — asserted low — f-or — t+re — entire duration of — t+te — access , — &rr^ 
values must be placed on the AD_DATA bus in the following sequence 
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(cycle — e — irs — ttre — first — cycle — where — AD_OEL__L — rs — i^m — f^srr — this 
transaction) : 

cycle 2"5 : Data to be written to control/status register. B&r 

registers — that — a-re — wider — than — 0 bits — (maximum — erf — 32 bits) 
5 write data must be presented one byte per cycle starting with 

LGD . — ^my — data — presented on — the bus — beyond — the — width — erf — t+te 
register will be ignored . 

cycles > — 5 : AGIC will assert AD RDY L on mpletion — of the 

write access, and will keep it asserted until AD_GEL_L is de - 
10 asserted . — 

Figure 12 — shows a Write Cycle. 

in — order — tts? — read — from — a — control/ status — register, — th^ 

read/write signal (AD_RW) must — be — high . Wre select signal 

(AD_GEL_L) — must — be — asserted — hs^ — ftrr — ttre — entire — duration — — the 
15 access, — gmd — values — must — be — placed — on — the — AD_DATA — hvts — ii-i — the 
following sequence — (cycle — 9 — is the first cycle where AD_GEL_L is 
low for this transaction) : 

cycle 0 — 3r: — Address of control/status register 

cycle 2 : AD_DATA bus should be released (hi z) 

2 0 cycles — — When the data — is available, — AGIC will — drive the 

read data — onto the bus , — one byte per — cycle — 'tor — four — cycles, 
along with assertion of AD_RDY_L signal. For registers smaller 
than 32 bits wide, — unused bits are presented as zeros. — The LGD 
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is present on the bus during the 1st clock cycle of AD_RDY L 
assertion , 

Figure 13 shows a Read Cycle, 



Wre — switch — chips — will — generate — interrupts — on — error 

5 conditions 5*re interrupt lines have t+te following 

characteristics : 



i-: Level Sensitive 

9r: Active Low 

6-: Asynchronous (rre — clock — generated — tts — gts — along — with — t+re- 

10 interrupt) . 

4-: Assume point ' to point interconnection with board logic which 

combines together interrupts. 

Interrupts are maskable on a condition by condition basis 

inside — each — chip . ?*re — interrupt — signal — — asserted — cm — fete- 

15 occurrence — crf — em — error — condition — and — i-s — cleared — when — t+re — error 

condition — i-s — cleared. Any — temporary — conditions — which — caused — &n 

interrupt are recorded in the chip so no phantom interrupts should 
be — seen . 



The reality of the switch is that errors will occur. — Wre 

2 0 intent in the following is to detail the expected system behavior 
and recovery strategy needed for each error type. 
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TADLE 17 : — Error r e cov e ry in th e ASICs 



Error 



De t ection Mcchaniam 



Error recovery required 



Hardware comments 



Stuck b i t on port ca r d eg r ess unst r i p c r sees data co rr u p tion 



from one fabric 



Stuck bit between agg & 



memory controller 



unstripcr sees da t a corruption 
from one fabric, either mute 



word or da t a. 



Stuck bit between memo r y uns tr i p c r sees data co rr up t ion 

from one fabric, either route 



co n troller &l separator 



word or data 



Stuck bit on fabric eg r ess 



Soft - fa i l on routcwo r d f r om At least two u n st ri pers sec cithe r Queue r csy n eh 



port card 



a r outcwo r d mismatch, a state 



with a h i gh number of r outcword 



mismatches, or data p a r ity e rr o r s 



or a n y number of unst rip c r s w i l 



sec a routcword mismatch, a 



high number of routcword m\s 



matches o r da t a parity errors and 



an agg r ega t o r w i ll see a synch 



Worst case scena r io involves 



faili n g routcword w i th d i ffe r ent 
fabr i c — r outewords — to — fabr i cs. 



Cither queueing a packet to the 



wrong — port or dropping th< 
t r aff i c — m — the- 



aggrcgato r can 



cause an i mpact t o all p o r ts 



Probabil i ty of i mpac t ing more 



po r ts goes up w i th traffic load 



and — memory ut i lizat i on — m 
memory con t rollers. 



Soft - fail on data from port - 
card 



Unst r i p c r sees one t i me er r or, None 



probability of automat i c hard 



ware based data recovery is h i gl * 



Soft - fail between agg/memo r y 
controller dest id bus 



At least two unstr i pcrs see cither Queue r csyneh 
a routcword m i smatch, a state 



with a high number of routcword 



mismatches, or da t a parity cri -or! 



soft - fa i l betwee n agg/memo r y 
cont r olle r data b us 



Unst r ipcr sees one time error 



None 



p r obabil i ty of automatic hard 



ware based data recovery i s h i gh 



goft - fail between memo r y 



cont r o l ler/separato r cha n nel ID 

bus 



At least two unst ri pc r s sec either Queue r csyneh 
a routcwo r d mismatch, a state 



wtth — a — ht^h — n umbe 



mismatches, or data parity errors 



Tokens get out of synch. Ma> 
s ee e rr o r of FITO overflow i n 



the — se p arator, — depend i ng — em 



t r affic pattern. Need congestion 



o n the fabric for a port to have 
the — FIFO — overflow — become 



p ossible. — May also sec excess 
tokens in memor^^controllcr. 



goft - fa i l between memory 



Packet — boundaries — from — one Queue Rcsyneh 



controller/separator data bus fo r 
aw data 



s eparator port are lost. Unstripcr 



w i ll show a la r ge number of 
e rr ors fo r all tr aff i c f r om the 



affected agg r egato r out p ut. 



I n he r ent that no self - stab i l i ze i n 



occurs w/o queue rcsyneh. 



s oft - fail between memory 



cont r olle r /se p a r a t o r data b us fo r 



S i ngle po r t sees e n c ' timc erro r . 



None 



packet 



dtrta 



s oft - fail o n token bus from 



M i smatches from fabric due to 



Queue Rcsyneh 
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separator to memory controller 


differences in separator 
scheduling. 














soft-fail internal to fabric ehipa 


Unstripcr sees different traffic 
from fabric than other fabrics 


Reset 


Queue — Resyneh may fix the 




problem, reset is necessary foi 
restoring state. 


















uggregator never sees back 


Aggregator never — sets — fteg 


Replace faulty hardw^arc. 


3amc as below 


plane idle to synchronize to rw 
btts 


indicating it has seen back plane 

sync 










aggregator never sees system 
synch 


Aggregator never sets flag 


Replace Faulty hardware 


Locating fault requires see in if 


indicating it has seen back plane 



sync 




9n4y — Htts — board — w — having 
problems — (backplane — syrre 
receiver) or if multiple boards 

sync signals on the back plane). 
Error isolation in 40G switch 
requires looking at the state of 
the — secondary — synch — pulse 










memory controller does not sec 




synch from agg 


replace faulty hardware. 










separator docs not see synch 
from mem cont 


Separator never — gets — initial 
synch 


replace faulty hardware 










unstripcr docs not sec back 


Unstripcr never gets back plane 


replace faulty hardware 










fabric chips not initialized 


Chips do not do anything 


fnitialize the hardware 


Fault can be caused by failure of 
the on-board processor. — If soft- 
fail, watchdog should catch it. 










Striper not initialized 


Transmit no data on the back 

_ 1 

3lanc 


Initialize striper 










Unstripcr no initialized 




Initialize unstripcr 
















Detection comes up as a result of 


striper, interrupt asserted 


a — disagreement — between — the 
stri pc — amount — and the 
configuration — register for the 
switch operating mode. 


















iTimary syncpuisc i a* Tuiiurc 


Synch pulse receiver on all 
boards will see error on primary 
and switch to secondary- 












Secondary sync pulse TX 
fatlurc 


Synch — pulse — receiver — on — aH 
boards — vnfl — sec — error on 
secondary. 










Sync pulse receiver failure on 
one board 


If leaving reset, no chips on 


Replace board with bad synch 


Need to see how wide error is 




board get in sync. — If during 


pulse receiver 


spread to attempt to identify the 


operation, should see a synch 
error either in an aggregator or 
an unstripcr fed by this block. 




source. 




None 






Board loses single sync pulse 


If any — riFOs — overflew — m 


internal to the board 


aggregator or unstripcr, queue 
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distr i bution to a s i ng l e ch ip on 
i fab r ic 



lard fa i lu r e o n sync p ulse 



May see FIFO overflow/undci Replace 



flow in fab r ic chi p or see synch 
failu r e f r om the down s t ream 



dnvp: — Additionally, i f data \ s 



cor r u p ted, — tfre — unstriper — wt^ 



c p ort data corrupt i on f r om th ( 
associated fabr i c. 



la r d fa i lu r e o n sync p ulse 



unst ri pc r' May see what looka Reset port card 



same as b elow. 



distr i but i o n to a s i ngle ch i p on 



i ke a single fab r ic mismatch due 



a port card 



to one fabric go i ng ou t of synch 
before the others. 



3oft fa i lure on syne pulse 



None 



distr i but i on to a s i ngle ch i p on 



If no FIFO overflow, none. — H St r i p e r — miss i ng — synch — pnht 



FIFO overflow, need to reset could ove r flow a FIFO on cvci ' > 



n port card 



boa r d(3) with FIFO overflow. 



fab ri c. Recovery would need to 



be — done — serially — isn6 — sw i tch 



could be effectively down by 



th i s error. — O n ly way to ensure 



all fab r ics do the same thing i s to 



ensure that data path has the 



same delay as the synch path 
s i nce — the — w r ites — occur — at 



different logical t i mes. 



An unstriper m i ssing — would 



affect the output po r t map p ed to 



the s tr i p e r and wou l d r equi r e a 
port card reset t o recover. 



3oft failu r e on sync p ulse 



unk n ow n 



Reset the fabr i c 



d i str i but i on to mult i ple chips 
on a fabr i c 



soft failure on sync p ulse 



iame as si n gle - failure case 



iamc as single - fa i lu r e 



Same as s i ngle - failure. 



d i stribut i on to mult i ple chips 



on a port -eard- 



The chipset implements certain functions which are 
described here. Most of the functions mentioned here have support 
in multiple ASICs^ so documenting them on an ASIC by ASIC basis 
does not give a clear understanding of the full scope of the 
20 functions required. 

The switch chipset is architected to work with packets up 
to 64K + 6 bytes long. On the ingress side of the switch, there 
are busses which are shared between multiple ports. For most 
packets, they are transmitted without any break from the start of 
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packet to end of packet. However, this approach can lead to large 
delay variations for delay sensitive traffic. To allow delay 
sensitive traffic and long traffic to coexist on the same switch 
fabric, the concept of long packets is introduced. Basically long 
5 packets allow chunks of data to be sent to the queueing location, 
built up at the queueing location on a source basis and then added 
into the queue all at once when the end of the long packet is 
transferred. The definition of a long packet is based on the 
number of bits on each fabric. — The following table gives the size 
10 of long packets for different switch sizes. 



TABLE 10 : Long Pack e t siz e s 
Snitch Sigc Packe t Size 



(by t es) 

A A r\nr\ 

"ttt yrfxy 

OA 1 OAA 

uT7 TTTTTvT 

~J 1 jLU 2, } \J\J 

1 /TA T^AA 

TTTTT J\J\J\J 

>1 A C /I AA 

AOf\ fV^AH 

*TuT7 J\J\J\J 



If the switch is running in an environment where Ethernet 
20 MTU is maintained throughout the network, long packets will not be 
seen in a switch greater than 40G in size. 



A wide cache-line shared memory technique is used to 
store cells/packets in the port/priority queues. The shared memory 



•is m. — entries — x — 200 ' bit — wide — running — srfe 12DMIIz . Each — mem 


ory 


controller AGIO yields 25Gbps memory bandwidth. — The aggregator 




( control port ) — generates at most 4 — streams of OC-40 — traffic . — 




enqueue — and dequeue — speed — fxrr — different — switch — configurations 


— i-s- 


shown — in — Hte — following — table . — Note — th:^ — a — ^ — speedup — can 
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achieved for all switch configurations except the 400G5witch. Up to 
234,057 cells can be stored in the 400G switch. The shared memory 
stores cells/packets continuously so that there is virtually no 
fragmentation and bandwidth waste in the shared memory. 



¥oT — t+re — short packet s/cells, — memory utilization can 


-be 


close to — 1 0 0 ■ — For the long packets, — the memory block before 


the 


start — cr£ — a — long — packet — e«m — be — almost — completely — wasted. 




minimum — length — for — a — long — packet — irs — 9 — cache — lines, — giving 


an 


effective — utilization — of memory close — to — since — i — ovtt — cr 


f— 4 



10 memory cache lines can be wasted. 



TABLE 19: — Ghai - cd Jvlcmoiy (1,638,400 bits) iu Each Mcmoiy Controllci ' 



Switches 


Enqueue 

r* 1 


Dequeue 

C> 1 


Speedup 


Cell Length 


Number ot 

/^-ii- 


TV/ VJ 


4.3Gbp3 


2Q.7Gbp3 




391 1 bita 




tjTTTj 


4.7Gbp3 


20.3Gbp3 




21 1 1 bits 




mG 


S.OGbps 


20Gbp3 




151 1 bits 


102,400 


1 /'A/-' 
I UUVJ 


5.3Gbp3 


19.7Gbp3 




121 1 bits 


126,030 


t /I A/-> 


~rr^\ 


ISGbps 


h€ 


9 1 1 bits 


163,840 


A or\r> 


9.4Gbp3 


15.6Gbp3 




G 1 1 bits 


234,057 



There exists t^p — to — multiple queues in the shared 
20 memory. They are per-destination and priority based. All 
cells/packets which have the same output priority and blade/channel 
ID are stored in the same queue. Cells are always dequeued from 
the head of the list and enqueued into the tail of the queue. Each 
cell/packet consists of a portion of the egress route word, a 
25 packet length, and variable-length packet data. Cell and packets 
are stored continuously, i.e., the memory controller itself does 
not recognize the boundaries of cells/packets for the unicast 
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connections. The packet length is stored for MC packets. — There is 
a limitation of 4K packets — (or cells) — in each of the MC queues. 

The multicast port mask memory 64Kxl6-bit is used to 
store the destination port mask for the multicast connections^ one 
5 entry (or multiple entries) per multicast VC. The port masks of the 
head multicast connections indicated by the multicast DestID FIFOs 
are stored internally for the scheduling reference. The port mask 
memory is retrieved when the port mask of head connection is 
cleaned and a new head connection is provided. 

10 Two configurations of port mask memory are supported: 

Sr. OK port connections^ — for a 240 G switch 

b-: 4K connections^ — for a 400 G switch, 

Dequeue performance is restricted by several factors: — i-)- 

Padding injected by the aggregator AGICs; 2) Left alignment entries 

15 inserted in the memory controllers; 3) Memory controller output bus 
fragmentation — caused by the multicast — connections; — 4i — Token bus 
latency — between — Hre — separators — atid — tite — memory — controllers ; — Erf- 

Separator — output — btjs — padding; and — 6-) Unstriper — output — hrrs- 

fragmentation . — A 400G switch is used as an example to analyze the 

2 0 worst - case — performance — since — irt — — most — padding, — overhead, — and 
congested traffic . 



The aggregator AGICs have to 

bit route word, — variable ' length packet 
to multiples — erf — i-2 — since — there — are — tS- 



pad a packet — ( including 30 ■ 
length field and datagram) 
memory — controllers — im — one 
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f abric . — The shortest packet each memory controller received is 7 - 
feri-fe — long — since — a — packet — cmi — be — as — short — a-s — 04 bit — long. — 

effective datagram is 3 bits. One entry will be — left aligned for 

every IG 200-bit memory entries. — The left aligned entry can be as 
5 short as 1 -bit long. The worst ' Case datagram dequeue efficiency per 
output port of a memory controller is : 

(10-bit — (dout_me bus width) — (3/7) — (datagram length in a shortest 
packet) — ^ — (15/lC) — (left aligned overhead)) — 250MIIz — (output bus 
speed) — 12 — (number of memory controllers) — 7^ — (number of output 
10 ports per separator) — - 502Mbps — 

Wte — best ' "case — output — data — btts — bandwidth — per — separator 

channel — irs — 2 bit — — 250MIIz, — i.e. , — 500Mbps. — in — other — words ^ — Wte 
worst- ' case dequeue bandwidth of a memory controller is bigger than 
the best case output bandwidth of a separator port. — 2x speedup can 
15 be achieved through the twice wide output bus of the separators. 
Orre — sync — cycle will — be — fired on the — output bus — erf — t+re — separator 
every 120 cycles. — 

The output bus of the unstriper AGIC is — 04 bit wide at 

IOOMII2 . — It can only carry one packet per cycle. — In the worst -case, 
2 0 up to DO bits are wasted per packet for an OC40 port. 

APS stands for a Automatic Protection Switching, which is 
a SONET redundancy standard. To support APS feature in the switch, 
two output ports on two different port cards send roughly the same 
traffic. The memory controllers maintain one set of queues for an 
25 APS port and send duplicate data to both output ports. 
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To support data duplication in the memory controller 
ASIC, each one of [[192]] multiple unicast queues has a 
programmable APS bit. If the APS bit is set to one, a packet is 
dequeued to both output ports. If the APS bit is set to zero for 
5 a port, the unicast queue operates at the normal mode. If a port 
is configured as an APS slave, then it will read from the queues of 
the APS master port. For OC48 ports, the APS port is always on the 
same OC48 port on the adjacent port card. 

Port mirroring is similar to the ADO except that any port 
can pair with any port. — Only one pair of port mirroring ports are 
supported . — A IG - bit port mirror register is used to identify the 
master and slave port involved in the port mirror operation. — Mrfb 
ports are compared to the master portion — (bit ID : 0 ) — of the register 
when dequeuing , — Port mirror can be disabled. — Note that a port can 
either — have APS — enabled or port mirroring enable, — rrot — both . — Wre 
value — erf — the — port — mirror — register — ea« — be — changed — on fly — by — 
shadow registers . 

The shared memory queues in the memory controllers among 
the fabrics might be out of sync (i.e., same queues among different 
20 memory controller ASICs have different depths) due to clock drifts 
or a newly inserted fabric . It is important to bring the fabric 
queues to the valid and sync states from any arbitrary states. It 
is also desirable not to drop cells for any recovery mechanism. 

A resync cell is broadcast to all fabrics (new and 
25 existing) to enter the resync state. Fabrics will attempt to drain 
all of the traffic received before the resynch cell before queue 
resynch ends, but no traffic received after the resynch cell is 
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drained until queue resynch ends. A queue resynch ends when one of 
two events happens : 

1. A timer expires, 

2. The amount of new traffic (traffic received after the resynch 
5 cell) exceeds a threshold. 

At the end of queue resynch, all memory controllers will 
flush any left-over old traffic (traffic received before the queue 
resynch cell) . The freeing operation is fast enough to guarantee 
that all memory controllers can fill all of memory no matter when 
10 the resynch state was entered. 

Queue resynch impacts all 3 fabric ASICs . The 
aggregators must ensure that the FIFOs drain identically after a 
queue resynch cell. The memory controllers implement the queueing 
and dropping. The separators need to handle memory controllers 
15 dropping traffic and resetting the length parsing state machines 
when this happens. For details on support of queue resynch in 
individual ASICs, refer to the chip ADSs. 

Multicast connections are enqueued into one of 4 priority 
queues based on the 2 bit priority number. — They are stored cache - 
2 0 line based like the way unicast connections do. — Connection numbers 
— lengths — ar^ — stored — into — one — of — 4 — IK entry — per ' priority 
connection FIFO. Multicast packets are subject to be dropped if the 
destined — connection — FIFO — i-s — full . — iri — other — words, — srt — most — 
multicast packets can be stored simultaneously for each priority. 
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Wt^ — G4KxlO bit port mask memory will limit the number of 

multicast connections supported to G4K^ — 32K^ — IGK^ — IGK, — &¥r, — and 4K 
for the 40G, OOG, 120G, IGOO, 240G, and 400G ^si/^itch, respectively. 

For the dequeue side, multicast connections have 
5 independent 32 tokens per port, each worth up 50-bit data or a 
complete packet. The head connection and its port mask of a higher 
priority queue is read out from the connection FIFO and the port 
mask memory every cycle (125MIIz) . A complete packet (or 50 bits if 
the packet — ts — longer — than 50 bits) — is isolated from the 200 - bit 

10 multicast cache line based on the length field of the head 
connection. The head packet is sent to all its destination ports. 
The 8 queue drainers transmit the packet to the separators when 
there are non-zero multicast tokens are available for the ports. 
Next head connection will be processed only when the current head 

15 packet is sent out to all its ports. 

For the worst case analysis, — use the — 400G switch as an 
example where the shortest packet is 7 bit long. — Every Ons cycle 
only one connection can be handled (bottlenecked by the connection 
FIFO and port mask memory) . — If the multicast only goes to 1 port, 

2 0 the effective dequeue throughput — for the multicast — connection is 
07 5Mbps out — of available — ISGbps — shared memory dequeue bandwidth, 

i.e., — 6%-. In other words, — the multicast performance — i-s — severely 

damaged by the bottlenecks — existing — in the — connection — FIFO, — port 
mask memory, — and head - of - line blocking. The throughput for the 400G 

2 5 switch is 400'*'7+n/00-n'*'42G where n is number of copies a multicast 

connection destined . In the worst — case where — n-1 , — the multicast 

throughput — 3r3 — about — 9% — available — switch capacity. — 3rf — thtg — average 
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multicast connections make 11 copies, — the switch can achieve 400G 
throughput . — 

The longer a packet is — (for the 240G switch or smaller 

configurations) , — the more ports — a multicast — connection destined; 
5 ttr^ — dequeue — performance — becomes — better — significantly . — Multicast 
performance — do — rro-fe — intervene — ttre — dequeue — speedup — fTSrr — unicast 
connections since the latter has their own tokens and two types of 
connections share the dout^me bus alternatively in a strict round- 
robin fashion, i.e., the multicast connections do not block unicast 
10 ones , 

There are 192 unicast queues, — 4 multicast queues, — and 4 

control port queues. — 4 multicast queues are per priority based and 
can broadcast to any subset of 192 output ports and the 4 control 
ports - 

15 There — a^fe — «p — to — — destination — channels — — blade 

channels and 4 control ports) for the 400G switch. Each destination 
I'ra^s — a — one-to-one — mapped — unicast — queue - — 4 — multicast — queues — 
broadcast to any subsets of 192 regular ports indicated by the per 
■ connection based port mask entry. — An OC 192 port uses one out of 

2 0 4 queue locations. Other three queues are unused. All 0 bit fabric 
queue — ID field on the DestlD bus is used to — identify one of — 3r96 
ports . — 2 bit priority field is unused. 

For the 240G switch. Up to 100 destination channels exist 

-f-9-6 — blade — channels — artd — 4 — control — ports) . — 9-6 — unicast — destination 

2 5 queues — have — 2 — priority — queues — each . 4 — multicasL — queues — emt 

broadcast — t^s — 9mj — subsets — of — 9-6 — ports — indicated by — tire — per con 
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nection based port mask entry. An OC ' 192 port — uses — erne — ovct — — 4- 

queue locations. — Other three queues are unused. — Lower 7 bit queue 
ID is used to identify one of 100 ports and lower 1 - bit of priority 
field is used to identify one of two priority queues in each port. 
5 Other queue — ID bit and priority bit is unused. — 

For the IGOG switch, Up to GO destination channels exist 

-(-64 — blade — channels — artd — 4 — control ports) , — &4 — unicast — destination 
queues have 2 priority queues each. — There are — CO unused queues — 4 
multicast queues can broadcast to any subsets of GO ports indicated 
10 by the per — connection based port mask entry. — An OC 102 port uses 
one — otrb — erf — 4 — queue — locations . — Other — three — queues — aire — unused . — 
Lower 7-bit queue ID is used to identify one of 100 ports and lower 
1 bit — crf — priority — field — i-s — used — to — identify — one — erf — two — priority 
queues in each port. Other queue ID bit and priority bit is unused. 

15 Fot — the — 120G — or — smaller — switch, — Bp — to — &2 — destination 

channels exist — (40 blade channels and 4 control ports). — 40 unicast 
destination queues have 4 priority queues each. — 4 multicast queues 
can broadcast — to any subsets — of — 4-8 — ports — indicated by the per — 
connection based port mask entry. An OC-192 port uses one out of 

2 0 4 — queue — locations . — Other — three — queues — grre — unused- Lower — G bit 

queue — tB — irs — used to — identify one — of — 52 ports — and 2 bit — priority 
field is — used to — identify one of — 4 — priority queues — in each port . 
Other queue ID bits are unused. 

Queue structure can be changed on fly through the fabric 
25 resync cell where the number of priority per port field is used to 
indicate how many priority queues each port has. 



The stripper AOIC resides on the network blade. — It has 
following features : 

Support packet/cell interfaces. — Can accept up to 3 GD/sec of 
sustained traffic (3.2 GD/sec in bursts) — of cells, — frames, or 
5 a mix of cell and fram e traffic. 

Generates fabric routeword for all fabrics in the switch 

Calculates data for the parity fabric and adds checksum to the 

end of each packet. 

^ Gupport switch configuration: 40G, OOG, 12QG, IGOG, 2^0G, and 400G 

10 Generates — appropriate — signals — t^) — interface — directly — to the 

transmit — side of the Gbit transceivers. 

The Gtriper takes DID cell/packet format from the ingress 

port AGIC. — For the ATM interface, — the AGX cell format is accepted 
from the Vortex AGIC — of the — Poseidon — chipset — at^ — 2 . 5Gbps — fmr — the 
15 channelized — blade . — ft — consists — of — A -byte — route — word, — 4 byte — ATM 
cell — header — (without — IIEC byte) , — arrd — 4 0 -byte — payload . — 30 bit — the 
switch — route — word — c€m — be — generated — based — cm — ttre — ^SM^ — route — word 
provided by the Vortex AGIC. 

¥he — Gtriper — AGIC — consists — of — three — ma j or — blocks : — the 

2 0 switch — route — word — generator, the — switch — payload — h — checksum 

generator, — and the switch parity generator. 

¥he — switch — payload — generator — forwards — 4 -byte — A:¥M — cell 

head, — 4 0 byte ATM cell — payload and — 2 - byte — checksum to — trp — t^) — 
switch fabrics and 1 spare fabric, — The cell bus is 2x 12 bit wide 
2 5 running at 125MII2. 



-75- 



The Gtriper AGIC duplicates the packet/cell and transmits 

various — fragments — tns) — ttre — fabrics . — 3r2 — data — output — buses — erf — ttre 
striper — AOICs — srre — connected — fet5 — t+re — data — input — buses — — t+re 
aggregator AOICs on the fabrics as follows: 

5 Figure 14 shows strip AGIC architecture. 



TABLE 20 : Data bus conn e ctivity of th e Otrip e r AGIC of blad e #1 



(D0UT_01'_l_ 


40G (1 fabric) 


QOG (2 fabrica) 


120G (3 fabiiu) 


160G (4fabrM 


240G (6 fabi io) 


400G (12 fabiM 




DIN_AG^l^l_di_l 
cdlfl 1:01 


DIN AG 1 1 eh 1 


DIN AG 1 1 eh 1 


PIN AG 1 1 th 1 


DIN AG 1 1 eh 1 


DIN AG 1 1 eh 1 


[5:0]-cell[n:0] 


i3:0]-eell[ll:0] 


:2:0]-ccll[11.9] 


[1:0] eell[11.10] 


[0]"ecll[ll] 






DIN AG 2 1 eh 1 


DIN AG 2 1 eh 1 
3:01-(;tlir7:41 


DIN AG 2 1 eh 1 
2.0] cdlI0.61 


DIN AG 2 1 eh 1 
1:0]' edl[9.01 


DIN AG 2 1 eh 1 
01-cell[lO] 








DIN AG 3 1 i,h 1 
-eelU3:01 


DIN^AG 3 1 til 1 
2:0]-aelir5:31 


DIN AG 3 1 eh 1 
1:0] eellI7.6] 


DIN AG 3 1 th 1 
01-cell[91 










DIN AG 4 1 uh 1 


DIN^AG„4_l_tlL_l 
1:0] eell[5:41 


DIN^AG_-<^l_th_l 
Ol-"ecll[01 












DIN AG 5 1 lI] 1 




cenr3.2] 


DIN AG 5 1 Lh 1 












DIN AG C 1 eh 1 






DIN AG 0 1 ^^h 1 














DIN AG 7 1 eh 1 














DIN AG 0 1 eh 1 














DIN AG 9 1 eh 1 














DIN AG 10 1 eh 

1 














DIN AG 11 1 eh 














DIN AG 12 1 eh 
1 LtlirOI 




parityfl 1.01 


DIN AG sp \ til 


DIN AG ap 1 eh 


DIN AG Ay 1 th 
112:01 pm-Hvr2.01 


DIN AG jp 1 lIl 


DIN AG jp 1 eh 



iH*te — striper — AGICs — cm — blade — its — connected — with 

aggregator — AGIC — fri — crf — gri-i — switch — fabrics , — ¥he — striper — AGICs — cm 
2 5 blade — frS — ts — connected — with — aggregator — AGIC — #2 — crf — &iri — switch 
fabrics. The striper AGICs on blade #4 is connected with aggregator 
AGIC # 4 of all switch fabrics. The striper AGICs on blade #5 to #0 
are connected with aggregator AGIC #5 to #0 of all switch fabrics, 
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respectively , — The striper AGIC5 on blade # 41 to #40 are connected 
with aggregator AGIC #5 to # 0 of all switch fabrics, respectively. 
In other words, — blade number moduled by 0 — is the aggregator AGIC 
number which a striper AGIC is connected to. — 

5 The parity bits are sent to the spare fabric. The purpose 

of the spare — fabric is to provide fault tolerance ability to the 
switch, — i.e., — in case one of the switch fabrics failed, — the spare 
fabric recovers the lost part of the cell. This is achieved through 
a — parity — bi± — generator — on — the — striper — AGIC. — For — one — fabric 

10 configuration, — the 12 bit cell payload is duplicated to the — spare 

fabric; — for 2 — fabric configuration, — C bit parity bits are generated 
as follows : 

parity bit(l:G) - cell bit(l:G) exclusive OR cell bit(7:12); 

For 3 - f abric — configuration, 4-bit — parity — bits aire 

15 generated as — follows : 

parity bit(l:4) = — ceii — bit (1 : 4) exclusive OR — eeii — bit (5 : 0) 

exclusive-QR (9 - 12) ; 

Wte — route — word — generator — regenerates — th-e — switch — route 

word and sends up to 12 1 1 1-bit 250MIIz route word buses for fabric 
2 0 1,2,3, . , — 12 and the spare fabric. 

The aggregator AGIC resides on the switch fabric as shown 

in the following figure. Each 40G switch fabric has Oil aggregator 
AGICs . — It aggregates Gx4 s e parate cell streams and route words into 
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a single 12G stream from up to C blades and 4 channels. — All input 
signals from the network blades are 250MIIz point to point IIGTL. — i:^ 
outputs a single cell stream that is multiplexed v^ith cell payload 
and route words to 12 memory controllers. — The A31C has — following 
5 features : 

12Gbps Data and route word input from up to G network blades 

and 4 — channels 

Route word separation and aggregation 

Output 12G data and route word to 12 memory controller AGICs 
10 ' IIGTL interface with the memory controller, — receiver interface 
for the backplane gigabit transceivers. 

Figure ID shows aggregator AGIC architecture. 

The aggregator AGIC supports QQG, 120G, IGOG, 240G, 

mtd — 400G — switch — configuration — without — backplane — change . Th^ 

15 backplane connectivity (DIN__AG buses) of a pair of aggregator AGICs 
is shown as — follows : 



TABLE 21: DI>^_AG bus comicctivity of aggregator ASIC #1 and #5 of switch fabric #1 



DIN,AG_l_l_th_bu 
DIN AG 1 5 eh bu 


♦OG (1 fflbrit) 


OOG (2 fabriea) 


laOGpfabricj) 


160G (4fjbiM 




400G (12 fabrics) 


DIN^AG_l_l_oh_l 


DQUT_GT_l_e 


POUT GT 1 eh U 
5:0] ■ed][ll:6] 


DOUT_GT_l_ch,][ 
3:Q]"eell[n:0] 


BOUT GT 1 tli 1[ 
}!Q] tt]\[]\.0] 


DOUT_0T_1^l1i 1[ 


DOUT GT_l_cli_l[ 


li 1 celirUiOl 








hO] tcli[11.10] 


3]-eell[n] 


DIN_AG_1 J>l'_Ui^-0] 




BOUT GT 5 eh 1[ 
5:01-eell[ll:C] 


D0UT^GT_5_th,l[ 
3.0] cdlfll.O] 


DOUT_0T_5^i,li \[ 
hO] eeliril.9] 


BOUT GT 5 til 1[ 
1:0] edirH.lO] 


DOUT_0T_5,ch_l[ 
D1 cdUll] 








DOUT_GT_')_eh_l[ 


DOUT_GT_Q_ch 1[ 


BQUT_GT 9_tli 1[ 


D0UT_GT_9_cli_l[ 


3.0] cdlfll.O] 


3:0] udl[ll.')] 


1:0] 11.10] 


0] tcll[ll] 


Dl>J_AG_l_5_cli_2[2. Q] 




rth 




DQUT_0T_l3^Lh 
ir^iO] ccll[M.9] 


D0UT_GT_13,l1i 
1[1:0] LLlUll.lO] 


DOUT GT 13, ell. 
UQ] edUll] 


DI^^_AG_l_l_ch_3 










D0UT_GT 17 eh 
IfliQ] tLlUn.lO] 


DOUT GT n.cli^ 

ifoi >^>^iirii] 


DI^J_J\G_1 J_LhJ 










BOUT GT 21 uli 

I'f Qi uuiri 1.101 1 


DOUT 0T^21_ch. 

irm cdi[in 
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nM 


rth 


ftht 


rtht 






Dl^i_AG_l_l_Jl^^ 


DOUT^GT_25_ch_ 


D[N_AG_I J_tli_4 


nht 






rtht 




DQUT_0T_2Q^l1i, 
IfOI cdirin 


Dl^j_AG_l_l_LllJ 


nh 






rth 




DOUT^OT_33_cli, 
U01 edl[ll1 


DW_AGJJ_l\\J 


rth 




rth 


rth 




BOUT GT 37 gh 
ir01'^cll[ll1 


Pl>n\G_l_l_cli_G 


rtkt 




rth 


rth 




BOUT GT A\ ell 
ir01 ^^tllfll] 


D1^^_AG,1 J_l]i,6 






rtht 


rth 







The 2 X G DIN_AG buses of aggregator AGIC #1 and #D pair 

of — switch fabric #1 — is connected to the — 12 x DOUT_0T bus — fri — erf- 
blade — ¥t-, En ^, H-; t^T — — 2^7 smd — ^ 

10 respectively. — The 2 x G DIN_AG buses of aggregator AGIC #2 and #G 
pair of switch fabric #1 is connected to the 12 x DOUT_GT bus # 1 of 

blade — #2-? &7 — 2*7 — 2^7 — 3^7 — ^fr? — an^d — ^ 

respectively. The 2x0 DIN_AG buses of aggregator AGIC #3 and #7 
pair of switch fabric # 1 is connected to the 12 x DOUT_GT bus #1 of 

15 blade — #5-? Hr? 3r9-? 26-; — — 3^7 ^^7 smd — ¥^ 

respectively. — The 2 x G DIN_AG buses of aggregator AGIC #4 and jfO 
pair of switch fabric #1 is connected to the 12 x DQUT_GT bus #1 of 

blade — fr^T 8-;^ ¥t-, "t^, — 9^, — 9r^, — m-, 92-; — and — 4^ 

respectively . 

2 0 Likewise, — the 2 x G DIN_AG buses of aggregator AGIC #1 

and # 5 pair of switch fabric # 2 is connected to the 12 x DOUT_GT 
bus #2 of blade #1, 5, 0, 13, 17, 21, 25, 2D, 33, 37, 41, and 4D, 
respectively . — The 2 x G DIN_AG buses of aggregator AGIC # 1 and #5 
pair of switch fabric # 12 is connected to the 12 x DOUT_GT bus #12 

25 of blade #1, — &7 — ^7 — 3:5-; — — — 2^7 — 2^7 — — — — and 4 5, 
respectively, — for the 400G switch configuration. — 



-79- 



Wre — above — connectivity — irs — repeated — 4 — times — fxrr — t+re- 

channelized blades, 

Fcnf the 4^67 &e67 120G, IGOG, 240G, gmd 4-6^ 

configuration, — each blade channel sends 12 x 3G-bit cell payload 
5 and 3G — bit route word, — G x 3 0 -bit payload and 30 bit route word, 
4 X 30 "bit payload and 3G-bit route word, — 3 x 30 bit payload and 
30 bit route word, — 2 x 3G"bit payload and 30 ' bit route word, — and 1 
x — 30 bit — payload — arrd — 30 bit — route — word — to — each — switch — fabric, 
respectively. — in — other — words, — the — whole — 12 '- bit — wide — cell — 
10 transmitted in the same fabric for the 40G switch while only a 1- 
bit wide — (1/12 cell) — cell slice is transmitted on each fabric for 
the 400G switch. 

The 00 bit DOUT_AG bus i6 split onto 12 memory controller 

AOICs, — each receiving 5 bit data and 1 bit clock signal — from one 
15 aggregator A3IC, — 5*te — 15 bit — DestID bus — is broadca^^L — to all — i:9r 
ntrollers . — Due to the fan out load concern, 3 copies of 
the signals are maintained, — each driving 4 AOIC loads. 

Every channel of the aggregator sends up to 12x3x200 bit 

cell/packet — stream — to — 3r2 — memory — controller — based — on — a — work 

2 0 conserving round robin dequeue algorithm, — i.e., — next source takes 
over — if the current — source runs out of eligible cells/packets to 
send. — Strict round robin algorithm is — used among 24 — sources . — Pcrir 
the 40G switch, — only 4 sou r c e channels exist. — A source is eligible 
to send a cell/packet whenever a full cell or a full short packet 

2 5 or a 12x3x200 bit segment of a long packet is received. — 
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Each memory controller ASIC receives 9 independent cell 

streams from 9 aggregator AGICs. There are D IDOMllz DIN_ME_f b_£;e 
buses^ — each consisting of a 5 bit data buS;. — a 1 bit clock signal, 
and a 15 bit DestID bus. The GO bit DOUT_Aa data buse>6 uf all 0 
5 aggregator AGICs are bit — sliced onto — 12 memory controllers, — each 
receiving 5-bit data from one DOUT_AG bus. — Every memory controller 
gets a separate non sharing clock signal — (named clkl to clkl2) — from 
each DOUT_AG bus to reduce the load of the clock pin while 3 memory 
controllers share a set of DestID bus from the DOUT_AG bus. — The 0 
10 DIN ME fb se — buses — of memory — controller — ^Hr- are nn e et e d — 1« — the 
BOUT AG buses of 0 aggregators as follows: — 
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DIH_MC_fb_l_l_dafea DOUT_AG_fb_l_data [4 0, 3G, 24, 12, 0] 

DIH_ME_fb_l_l_dest - DOUT_AG_fb_l_destl 
DIM MC fb 1 1 oik ^ BOUT AG fb 1 clkl 



BIN_ME_fb_l_2_data BOUT_AG_f b_2_da t a [40,30,24,12,0] 

BIN_MC_fb_l_2_de3t ^ BOUT_AG_f b_2_de3t 1 
DIM ME fb 1 2 oik - BOUT AG fb 2 olkl 



BIN_ME_fb_l_3_data BOUT_AG_f b_3_da ta [40,30,24,12,0] 

BIM_ME_fb_l_3_dest BOUT_AG_f b_3_de a 1: 1 

2 0 BIH_ME_£b_l_3_el k B0UT_AG_fb_3_olkl 

BIH_ME_fb_l_4_data BOUT_AG_f b_4_da t a [40,30,24,12,0] 

BIN_ME_fb_l_4_dest ^ B0UT_AG_£b_4_d e Ai:l 

BIH_ME_fb_l_4_olk B0UT_AG_£b_4_olkl 

BIN_ME_fb_l_D_data BOUT_AG_f b_D_data [40, 30, 24 , 12, 0] 

25 BIH_ME_fb_l_5_dest ^ BOUT_AG_f b_5_de3t 1 

DIM ME fb 1 D oik BOUT AG fb 5 olkl 
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DIN_MC_fb_l_G_data - DOUT_AG_rb_G_data [ 40 , 3G, 24 , 12 , 0 ] 
DIN_MC_fb_l_G_dc.st - D0UT_AG_rb_G_de5tl 
DIN MC fb 1 G elk - BOUT AO fb G ^Ikl 



•« DIN_MC_fb_l_7_data - D0UT_AG_£b_7_data [4 0, 3G, 24, 12, 0] 

5 DIN_MC_fb_l_7_de.st DOUT_AG_f b_7_deAt 1 

DIN_ME_fb_l_7_elk D0UT_AG_J:b_7_c 1 k 1 

DIN_ME_fb_l_0_data - DOUT_AG_fb_0_data [4 0, 3G, 24, 12, 0] 

DIN_ME_fb_l_0_de3t - DOUT_AG_fb_0_de6tl 

DIN_ME_fb_l_0_clk — DOUT_AG_f b_0_c 1 k 1 

10 DIN_ME_fb_l_C)_data DOUT_AG_f b_9_da t a [40,30,24,12,0] 

D I N_ME_f b_l_0_de 3 1 DOUT_AG_f b_9_d e .5t 1 

DIN ME fb 1 0 elk BOUT AG fb 0 clkl 



Wre — DIN_ME — data — buses — — memory — controller — fr2 — sie^ 

connected to bit 49,37,25,13, and 1 of the DOUT_AG data buses of D 
15 aggregators, — and so-, on. — The BIN_ME data buses of memory controller 
#12 are connected to bit 59,47,35,23, — and 11 of the BOUT_AG data 
buses of — 9 aggregators . 



izi memory co ntroller AOICs aggregate cell/packet streams 
from 0 1 1 — aggregator AOICs . — Then write — the cells — into one — &f — 2^ 

2 0 output — queues — (e.g., — i-2 — network blades — x — 4 — channelized — Poseidon 
interfaces x 4 priorities for unicast — i — 4 priorities for multicast 

H — 4 — control port queues) . The 0 bit destination queue number on 

t+re — Best ID — btjs — 3r:s — used — a^s — t+re — output — queue — indicator — for — ttre 
unicast — connection . — Wre — multicast — cell — i-s — stored — into — one — crf — 4- 

2 5 priority queues based on the 2 bit priority on the DestID bus. — 5*re 
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IG ' bit multicast connection number on the DestlD bus will be used 
to lookup the internal port mask memory to find out the destination 
blade and channels during the dequeue phase. — 

The memory controllers send out cell/packet traffic from 

5 ^^-6 — output — queues to — — separator AOICs. — Dequeuing — speed is — a^s- 
twice fast as enqueuing speed to reduce amount of cells buffered on 
the switch fabric . 

' Oupport both variable - length packet switching and fixed length 
cell switching 

10 ' 12 AOICs are bit — sliced and function as an integrated shared 

ntroller 

Support ^m-r &m-, 120G, ICOG, 240G, and 400G switch 

configurations 

-* Enqueue cells/packets from D aggregator AOICs 

15 •* 2x dequeue speedup to — D separator AOICs 

On chip AFO support 

234,057 cells on chip buffer — 

200 programmable destination queues 

On chip control port support 

2 0 G4K multicast connections, — 2^32 unicast connections. 

Per - queue transmit and loss counts 

Figure IG shows memory controller AOIC architecture, 

A — 0Kxl3 - bit — link — list — i-s — used — — maintain — free/used 

memory entry list pointer. — A free entry is requested from the free 
2 5 link list when writing data into the shared memory and the current 
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tail — cache — line — runs — otrfe — erf — space . — Complete — cell/packet — will — be- 
dropped whenever the free list is empty^ — i.e., the shared memory is 

full . — A memory entry is free to the — free list after Lhe memory 

word — is transmitted to the separator AOICs. 

5 Figure 3r7 shows wide cache line shared memory 

architecture . 

DIN_ME_fb_se_D — and DOUT_ME_f b_se_9 — buses — aire — o^^ed — txr 

connect to aggregator #9 and separator # 9, — which conmiunicate with 
the control port striper and unstriper AGICs only. — It has the same 
10 DestID and cell format as other 0 buses do. — Its cells are enqueued 
and dequeued in the same way as the regular cells. — 

There are up to — 4 additional — control port queues. — They 

have queue ID from 192 to 195. — All unicast connections having the 
control port queue ID as its fabric queue ID is enqueued into the 
15 relative — control — port — queue . — There — are — a-fe — most — i — OC 12 — control 
ports — supported . 

Each — control — port — queue — — a — 13 bit — control — port 

register as follows: 



TABLE 22 : — 13 bit Control port qu e u e r e gist e r 



Bit 12:5 


Bit4 






r>i J. ■* mm J. e\ 
UlC 1 |UIT IT 






Control Port 3 enable 


Control Port 2 enable 




8 "bit i'ci;ulai' port ID 


Regular Port enable 






Control Port 1 enablc|Control Port 0 enable 



A — queue — C€m — be — multicast — to — trp — to — 4 — physical — control 

ports — attd — otte — regular — queue . — When — a — queue — irs — redirected — to — t+re- 
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regular — queue — that queue must be disabled for the — regular queue 
traffic . — Fackets are queued in the same way as the regular queues - 
dc^ — i.e.; — 200 bit — cache — line based. — Left aligned every — 3r6 — cache 
lines . — Strict — round -robin — among — 4 — queues — when — a — left alignment 
5 entry is transmitted. A queue is routed to 4 control ports and one 
regular port based on the 5 bit control port enable vector. — 

Two dequeue algorithms are applied among 4 control port 

queues : 

— ai One control port only talks to one cp queue: — Pure round - 

10 robin dequeue among 4 non empty control port queues which 

have non — zero unicast tokens; — one token worth unicast — (trp- 
to 200 bit) — is sent out to dout_me bus — for a port; 

hj ©rre — control — port — talks — to multicast — cp queues: — Otrict 

priority — among — 4 — control — port — queues ; — queue — 3r&2 — h^rs- 
15 highest priority and queue 195 has lowest; — switch queues 

when the end of the packet is seen. 

0AM cells are identified by the Fabric queue ID field. — ^ 

this field of a unicast connection. has value OxFx(h), — then it is an 
QAM cell. — All 0AM cells can be mapped into one of the 192 blade or 
2 0 4 control port queues set by a 0 bit programmable register — (called 
0AM cell destination register) , 

Resync cell — (Oy^FF) or any other special cells \^ith fabric 

queue ID set to OxFx are routed to any one of 19G queues based on 
the 0AM cell destination register too. — 
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Fer destination minimum and maximum thresholds and counts 

can be set up to help memory management. — 200x2x14 ' bit thresholds 

(in unit of 200 bit entry) and 200 x 13 - bit running counters — (-im- 

unit erf — 200 bit entry) — provided. ¥wo — additional — per 

5 destination — transmit — and — loss — counts — (32 bit — each, — in — unit — o^- 
packets) are also maintained. If the running count of a destination 
is above the relative threshold, — new packets are rejected and loss 
count increments . — Whenever dropping, — the whole packet is dropped . 

Otherwise, tte transmit count increments . Ftrr multicast 

10 connections , — cells can also be rejected due to the multicast route 

word FIFO is full. — 4 additional FIFO full counts are needed. If a 

packet — i-s — dropped, — the — whole — packet — i-:5 — cleaned — from the — memory 

( including — th-e — segments — crf — a — long — packet) . — Wre — thresholds — &rrdc 
current counts are in unit of 200-bit cache lines. 

15 ¥he — minimum threshold — (13 bit — value — plus — 1-bit — enable 

bit ) — is used to prevent shared memory starvation, — i.e., every queue 
reserves — at — least — th-e — number — of — cache — lines — indicated — by — t+te 
threshold . — The maximum threshold — (13 bit value plus — 1 bit — enable 
bit ) — is used to prevent any single queue consuming the whole shared 

2 0 memory . — These two thresholds cannot be changed unless there are no 
packets in the queues . 

Mri — counters — aire — 32 "bit — wide . — They — aire — reset — to — zero 

automatically after reading . — Their values — stick to — OxFFFFFFFF if 
overflowed . — It takes 2^32 x 0ns — - 32 seconds to overflow a counter 
2 5 in the worst case . 

The value of any threshold registers can be updated on - 
fly by a resync cell or a shadow control cell. — The content of the 
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^2 — bit shadow data register is copied to the location pointed by 
the shadow address register. 

The memory controller can enqueue a — single QC-192 — data 

stream from the aggregator AOIC and dequeue a single OC 192 — data 

5 stream to the separator AGIC instead of ■4xOC - 40 — streams ■ — At the 

ingress — side^ — the AOIC receives — 4 — continuous cells /packets /cache 
lines — from — t+re — same — source — channel — instead — of — 4 — channels . — ^ 
special treatment is needed. — 

At the egress side, the Queue Drainer reads 4 cache lines 

10 from the shared memory for one destination after a token command is 
received for the OC - -192 port. — The RCD can send up to 4 — 200 bit 
cache lines to the separator from the same destination queue. — Each 
OC 102 port has 4 priorities for all switch configurations. 

The separator AGICs receive cell/packet streams from 12 

15 memory controllers, separate, and send them up to 40 network blades 
through the backplanes. — The interfaces between the — separator and 
the backplane are 2D0MIIz point to point IIOTL signals - 

Figure 10 shows the Separator AOIC architecture. 

Receive 12 data streams from 12 memory controllers 

2 0 Fabric synchronisation 

24 - destination — (blades and channels) — addressing 

Route word separation and aggregation 

0.2 Bum 3V CMOO technology 

410 I/O pins 
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^ 140 bit 2D0MIIZ input. 240 bit 250MIIz output (at mooL 120 ul 

them switch simultaneously) ; — 30 ' bit control signals 

— separator — h«-s — twice — number — of — data — output — pins — 

that of the aggregator AGIC to support 2X speedup. Similar to Lhu6e 
5 of the 3triper ASIC, the AGIC supports 40G, OOG, 120G, IGOG, 240G, 
and 400G switch configurations without backplane change. - 

Wte — separator — AOIC — performs — reverse — function — crf — th^ 

aggregator — AGIO. — The AGIC — receives — 120 bit — 250MII2 — cell/packeL 
stream — from — orr^ — erf — 8 — DOUT__ME_f b_se_bu — buses — erf — every — memory 
10 controller (12 of them) , 10 - bit blade and channel selection signals 
are used to select one of 24 destinations inside each separator for 
up to two cells. For example, the DIN_GP buses of separator AGIC #1 
■irs — connected as — foil OWS ! 
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Wh e n a valid cell/pack e t — ( channel — ID is — in th e — r 

-e — 26-) — irs — received, — ttre — packet — type — field — trt — fe+r« — route — word — irs- 
checked — first . — — i^t — is — an ATM — cell, — no — packet — length — field — rs- 
followed. The length of cell payload is 3Gxl2/number of fabrics. If 
5 it is a packet, — the packet length bit immediately followed is used 
to — indicate — how — long — a — packet — length — i-^-: — 0-12 bit — packet — length 
( including — this — bit ) — and — 1-24 bit — packet — length — ( including — this 
bit) . — The entire packet/cell is routed to the destination channel - 
indicated by the channel — i-B-. — The invalid channel — fB — (bigger than 
10 — is used to indicate that the cell/packet is invalid- — 

The AOIC — then — separate — tite — route word — and the — payload 

onto the route word bus and the data bus of one of G blades and 4 
destination — channel s/unst riper — AOICs — based — on — the — channel — ¥B 
signals. One 250MIIz 24 - bit data bus yields GGbps data bandwidth for 
15 each channel. — Each route word is 2 -bit wide running at 25QMIIz. — 

Wte — connectivity — between — the — separator — ASICs — &ndt — the 

Unstriper AGICs are symmetric to those between the aggregator AGICs 
and the — striper AGICs, — The only difference — is — that — a*i — data and 
route word pins have double width to achieve 2X speedup. 

2 0 Data — received — from — each — destination — crf — each — memory 

controller — ha^ — a — 1 bit — valid — bit — accompanied . — There — a-re — 2^ 
destination — input — FIFOs — a^fe — used — tT5 — store — the — 3r2 — pieces — of- 
cell/packets — from 12 memory controllers — f^or — ?r4 — destination blade 
■artd — channels — in — each — separator, — respectively. — When — aii — i-2 — cell 

2 5 segments arrives, — the complete cell is sent to the relative output 
FIFO indicated by the channel — i^-r 



-89- 



Like the striper AOIC, a 3 - bit sequence number counter is 

maintained for — the backplane — synchronization . — i± — increments every 
3G 250MIIZ cycles. — When a cell — is sent to the unstriper AOICs via 
the backplane, — the — current — counter — i-s — attached into the — sequence 
5 number field in the 3 0 bit route word. — 

Wte — sequence — number — counter — i-s — reset — by — th-e — global 

resynchronization logic . 

The unstriper ABIC takes OGbps — traffic from up to 12 i 1 

switch — fabrics . — f-b — then — unstripes — t+te — cell — arrd — send — irt — to — t+re 
10 egress netmod AOIC at 5Gbps or lower speed. 

Receive — GGbps route word and data from up to 12 1 1 fabrics at 

250MIIZ — fcrr — OC40 — or — combine — 4 — chips — to — support — 2-9 — Gbps 
routeword and data from up to 12 l 1 fabrics for OC192c 
Error — check — data — transport — throughout — fe+re — switch, — detect 
15 corrupted data and perform data recovery 

Reconstructs cells/packets from the individual switch fabrics. 

Gend G4 bit lOOMIIz data to the egress port AGIC for OC40, 25G 

bit for OC102c 

Supports both UC and MC connection context for fabric data. 

2 0 Figure 19 shows the unstriper AOIC Architecture. 

— unstriper — AGIC — receives — cells — from — trp — to — 12 i 1 

fabrics , — each — running — at — 250MIIz . — ft — uses — t+re — following — steps — to- 
reconstruct d data. 
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— — incoming — routewords — a-3fe — compared . — tf — arry — errre — routeword 
disagrees , — that — data — lane — ts — flagged as — being — im — error . — If more 
than one routeword disagrees, — the data is dropped. 

2. All valid input lanes are put through reconstruction logic which 
5 will attempt to build nil candidate output data streams for an N 
fabric switch. Any data lane which is not valid will invalidate any 
data lane which uses that data. 

6-: — ^^ir± — valid — reconstruction — lanes — will — check — t+te — — erf — th^ 
received data and one passing output is selected. 

10 The striper remaps the separate routeword and data buses 

to a combined outgoing routeword — idata bus. 

The following will detail the steps which happen at power 

trp — from — gm — architectural — perspective . — Note — that — when — expanding 
switch — capacity;. — ttre — additional — fabrics — must — be — brought — on - line 
15 before any new port cards are brought on line. 

Fabric Initialization 

^r-^ Port — cards — (unstripers) — grre — initialized — — only — look — a-b 

current — fabric capacity and ignore other fabric inputs. 

9rr- Fabric is inserted, asserts its board present signal. Gtripers 

2 0 start sending routewords to the new fabrics, — though they are 

ignored at this point. 

Board is — reset, Mef — starts — to — boot — ttre — board. Before 

proceeding to the next step, — the MCF/GCP establish contmunica 
tion via the — e net network. 
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4-^ If the board is fabric 0 or the parity fabric^ — the sync pulse 

transmitter is initialized. — (Actually sync pulse transmitter 
can be initialized on all fabrics^ but it is only connected to 
DP signals if it is fabric 0 or the parity fabric.) 

5 -Er: — initializes — sync — registers — in — Hrc — aggregator, — memory 

controller, — and separator, — then initializes the registers in 
t+re — sync — pulse — receiver . — Wre — sync — pulse — receiver — starts — to 
look for a valid sync pulse. — The last sync setup is the sync 
pulse receiver, — so that all receivers on the chips are ready 
10 fTSrr — the sync pulse — from the sync pulse receiver. — Wre — fabric 
chips run chip - chip sync on the next backplane sync pulse. The 
MP should check to make sure the — fabric has — synchronized. — i-f 
sync — htns — not — been — achieved, — reset — Hte — fabric — chips — scrvA — re 
execute step 4 . 

15 -6-: OCP tells MP the current switch capacity v/indow to use. — This 

is actually going to correspond to the current switch capacity 
(does — rtot — count — ttre — capacity — of — the — rrew — fabric — i-f — switch 
capacity is being expanded) . 
t4 MP — initializes — tire — backplane — transceiver — networks — with — the 

2 0 current switch capacity — (both send and receive) — and initial 

izes all — registers except the aggregator — input enables. — 

values — used — for — configurable — options (which — ports — a^re 

OC40/OC192, — memory thresholds, — etroj — need to be conmunicated 
and — initialized — at — this — point . — Certain — registers — a-re — ini 

2 5 tialized based on the — switch board — slot, — which needs — to be 

known at this point. — From a software perspective, the biggest 
register — set — which must — be — done — rs — to — update — the — port — mask 
table in the memory controllers to match the port mask table 
from another switch fabric. 
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■8-: Aggregator — input — enables — grre — s^t — £-crr — th-e — current — switch 

capacity. — This will start enqueueing traffic on this — switch 
board. The aggregators will need to see a bus idle followed by 
an increment in the transmit sequence number before starting 
5 to actually receive data. 

— OCT sends a queue resync cell. On cell return, — fabric queues 
are now synchronized. However, no valid data is being enqueued 
in the new fabric (s) and the fabric outputs are being ignored. 

i^-: All unstripers must be configured to start utilizing the new 

10 fabric . — Gince — queues — have — been — resynchronized, — t+re — fabric 

dequeuing should be synchronized and no errors should be seen. 
If errors are seen, — clear them, — return to step 0. 

irir: After — a-M — unstripers — have been updated, — &ep — tells — all port 

card MCPs to update stripe amount inside each of the striper 
15 AGICs . — ¥he — change — ir^ — striper — configuration — will — start — fe+re 

switch utilizing the additional capacity. 

ir9-. After — &±± — stripe — amounts — aire — updated — and — traffic — from the 

previous — stripe — amount — drained — from — tte — switch, — then — t±re 
switch capacity needs to be updated. The only fixed time bound 
2 0 wsry — of — ensure — traffic — from — the — previous — stripe — amount — irs- 
flushed is to execute a queue resync. — If not all traffic has 
been flushed from the system with the previous stripe amount, 
t+te — switch — will — drop — this — traffic — a-t — bh-e — unstripers — (since 
there is no synchronization of the update at the separators, 
2 5 the drop cannot be performed there) . 

Before — a — port — card — i-s — brought — on - line , — arty — necessary 

switch — fabrics must — be brought — on-line — first . — ^ns — per — th-e — switch 
standard convention, — port card installation happens in order. 
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ia-: — 5*re — starting state has — sufficient — si^ritch capacity to — support 
the new port card. Aggregators are currently configured to ignore 
the input — from any new board. 

•itr: — Fort — card — i-s — inserted — artd — asserts — irt^ — board present — signal . 
5 Fort card sees sync pattern received from the fabrics. 

2-: — 54te — sync pulse — receiver — i-s — initialized . — The port — card — starts 
looking for a valid sync pulse on the backplane-. - 

— Gtriper — transmitter — — se-fe — op — fTsnr — bhe — appropriate — number — 
destination — fabrics — and the — Gbit — network control — irs — initialized . 
10 Before the GDit networks are initialized^ — the fabrics cannot count 

on seeing idle data from the new port card. At this pointy. — ttre 

port card can communicate its type — (OC40/OC192) — to the fabrics. 

5^-: — Fabrics configure the port card type and enable the input from 
the port card. 

15 -&b-: — Otriper/unst riper — grre — irow — initialized, — along with — t+te — other 
chips on the board. Gome enable in the inbound data path should be 
disabled. — The DID input enable in the striper can be used or some 
other board specific input enable. 

■G-: — After — both — — and — 5te — have — been — completed, — t-he — port — card — cmt 
2 0 enable its input side and start sending data to the fabrics. — Note 
that — in — general, — furth e r — software — configuration — will — need — — be 
done after this point — (such as setting up inbound lookup entries) . 
¥he — completion of — 5« — i-s — necessary to ensure — the — fabric queues — do* 
not go out of sync. 
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T^-: — First data from the port caard is striped to all fabrics. 

0. When a port card is removed from the system, — not very much needs 

to happen from a hardware perspective. Before the port card goes 

away, — it transmits a packet abort which will cause any incomplete 
5 packets in the egress side to the dropped. — Traffic will be drained 
from the — memory — queues — which — correspond — to — th^ — affected — output 
ports . 

■9": ^ — remove — a — port — card — from — the — switch — logically, — software 

should disable the striper output bus. 

10 Fabric deactivation is — similar to — fabric activation in 

reverse . The steps include : 

irz — Switch capacity is being removed. If port cards are present in 

the switch which are paired with the fabric capacity which is about 
to be removed, — those must first be deactivated. 

15 9n Program the remaining stripers in the system to stripe data to 

one less stripe amount than the current configuration. This will 

stop sending real data to the fabric about to be decommissioned, 

3-: Oend a queue resynch. This will flush out any traffic at the 

last stripe amount. 

2 0 4-: Program the — unstripers — to — start — ignoring — t+te — data — from the 

fabric which is about to be removed. 
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— fabric can now be physically removed from the — system^ — or 

logically — removed — from — tire — system — by — disabling — i±^5 — inputs — and 
outputs . 

The reason for the queue resynch step is not because the 

5 switch — i-s — otrfc — of — sync ■ i4"re — unstriper will — treat — the — receipt — of- 

traffic which is striped to more fabrics than physically present in 

tite — switch — a-s — att — error — and — increment — error — counts . ¥he — queue 

resynch ensures that the error counts on the unstripers will not 
increment unnecessarily. 

10 ■ irz — Flush out — traffic — from the port to be converted over to APG. 
Initialize anything in the separator as required for the new output 
port — combination . 

9r: — Write to the APS enable bit using the shadow register in every 
memory controller for the output port being affected. The main port 
15 for AFO is not affected. — Either a higher or lower number port can 
be the primary port and the backup port. AFG is always enabled on 
the backup 

3-, — Oend either a queue resync cell or a shadow control cell to all 
memory controllers . 

2 0 4 . Memory controllers start to dequeue after the next left aligned 
cache boundary — (if the previous transfer for this port was — lef t " ■■ 
aligned, — it will be remembered) . 



Note that in all this process^ — the queue number was never switched. 
¥he — switch — will — not — support — a — seamless — port — swap — dtre — to — 
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activate /deactivate . — (In other words, APO can be turned on port 0, 
which will cause port 0 to mirror port IG, — However, — APG cannot be 
turned off — on port — i-^ — since — i-b — ts — not — om — Traffic — rs — only being 
changed for the port where AFO is added.) 



5 The following words have reasonably specific meanings in 

the vocabulary of the switch. Many are mentioned elsewhere, but 
this is an attempt to bring them together in one place with 
definitions . 



TABLE 23 



10 



15 



Word 

APS 

Backplane 
synch 

BIB 
Blade 

BOB 



Egress 
Routeword 
Fabric 
2 0 Routeword 
Freeze 



Meaning 

Automatic Protection Switching. A sonet/sdh standard for implementing redundancy on physical links. 
For the switch, APS is used to also recover from any detected port card failures. 

A generic term referring either to the general process the the switch boards use to account for varying transport 

delays between boards and clock drift or to the logic which implements the TX/RX functionality required for 

the the switch ASICs to account for varying transport delays and clock drifts. 

The switch input bus. The bus which is used to pass data to the striper(s). See also BOB 

Another term used for a port card. References to blades should have been eliminated from this document, but 

some may persist. 

The switch output bus. The output bus from the striper which connects to the egress memory controller. See 
also BIB. 

This is the routeword which is supplied to the chip after the unstriper. From an internal chipset perspective, 
the egress routeword is treated as data. See also fabric routeword. 

Routeword used by the fabric to determine the output queue. This routeword is not passed outside the 
unstriper. A significant portion of this routeword is blown away in the fabrics. 
Having logic maintain its values during lock-down cycles. 



Lock-down Period of time where the fabric effectively stops performing any work to compensate for clock drift. If the 
backplane synchronization logic determines that a fabric is 8 clock cycles fast, the fabric will lock down for 
8 clocks. 



Queue Resynch A queue resynch is a series of steps executed to ensure that the logical state of all fabric queues for all ports is 
identical at one logical point in time. Queue resynch is not tied to backplane resynch (including lock- down) 
in any fashion, except that a lock-down can occur during a queue resynch. 

SIB Striped input bus. A largely obsolete term used to describe the output bus from the striper and input bus to the 

aggregator. 

2 5 SOB One of two meanings. The first is striped output bus, which is the output bus of the fabric and the input bus 

of the agg. See also SIB. The second meaning is a generic term used to describe engineers who left Marconi 
to form/work for a start-up after starting the switch design. 

Sync Depends heavily on context. Related terms are queue resynch, lock-down, freeze, and backplane sync. 

Wacking The implicit bit steering which occurs in the 0C192 ingress stage since data is bit interleaved among stripers. 

This bit steering is reversed by the aggregators. 
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¥he — Aggregator — Receive — Oynchronizer ^ s — function — ts — 

maintain logical cell/packet ordering across srti fabrics . 

Cells/packets arriving at more than one fabric from different port 
cards — need to be processed — in the — same — logical — order — across — ait 

5 fabrics . If cell/packet logical ordering is not maintained^. — then 

cells/packets — coming — otrt — of . fabrics — will — have — stripes — erf — 
particular cell/packet not match up and will not be able to be re - 
assembled by the Unstriper, 

Logical — cell/packet — ordering — needs — tc» — be — maintained 

10 across the — following conditions : 

Transport — delay — variances — between — cnte — source — arrd — multiple 

destinations 

Clock drift across transmitters and receivers 

Insertion and removal of port cards and fabrics 

15 ■ Port card errors such as no sync^ — no lock ' downs^ — too fast/too 
slow, — routeword parity errors 
' Gigabit transceiver errors such as loss- " of - lock, — data errors 
Non ' synchronized updates to Gigabit network 

OC192c — data — streams (aggregating — 4 — channels — to — make — ttp — one 

2 0 OC192C stream) 

5*re — switch — uses — a — system — erf — transmit — scrrd — receive 

counters . 54"^ — counters — allow — &±i — components — in — Hte — system — to 

logically — align — themselves . 54te — Master Oequence — Generator 

implements these two counters that will count continuously from ^0^ 
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"to — — and will increment every x 125 MHz clock cycles wht^nd;. — x is 
the counter tick length as progranmed by software. — x is currently 
calculated to be 250 cycles. — This is based on analysis done in the 

Backplane — Synchronization — ADO , Tir^ — relationship — between — t+te 

5 transmit — artd — receive — counters — ergm — be — seen — in — Figure — 2-6-: Brr^ 

counter will be used by the transmit synchronizers in the Gtriper 
•arrdi — Separator — AOICs — and — the — other — counter — will — be — used — rn — the 

receive synchronizers — in the Aggregator and Unstriper AGlCs. 5he 

receive counter will be a delayed version of the transmit counter. 

10 The amount — erf — delay — is programmed by — software — in the — Oync — Fulse 

Receive — Delay — register . This — register — determines — the — number — 

clock cycles that the receive counter waits before incrementing its 
own counter relative to the transmit counter. — This register should 
always be non-zero since the transmitter will have no delay and the 

15 receiver needs to be delayed with respect to the transmitter. Wte 

Oync Fulse Receive Delay has been estimated to be 150 cycles. — Wte 
delay — 37:5 — approximated — equal — to — the — worst — case — transport — delay 
between — transmitter — and receiver — plus — worst — case — transport — delay 
variance — crf — the — sync — pulse . ¥he — delay — also — takes — into — account 

2 0 worst case fast and slow transmitters and receivers. 



The Oync Fulse Period is defined as the number of cycles 

between sync pulses. — It is extended slightly by about 10 cycles in 
order — fcnr — irt — te — appear — late — in — the — ^ — window — of — each — AOIC^ s 
sequence count. — This is done to ensure that every AOIC will appear 
2 5 tT3 — be — running — too — fast — even — tf — they — a-re — actually — running — slow 

relative to the clock that generated the sync pulse. If this was 

not — done, — the — sync — pulse — could — appear — ii*^ — the — ^-9-^ — window and — the 

AOIC would consider itself to be slow. There would be no way for 

it to catch up. Each transmitter and receiver will calculate the 
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difference — between when — th^ — sync pulse — arrives — and when — rt^s — omr 
counter transitions from ^3^^ — t^^ — ^0^ . — This difference is the number 
— cycles — that — irt — ts — fast — artd — i-s — referred — to — a-s — t+re — lock down 
amount(2 in figure). — Once a transmitter determines it should lock 
5 down for z cycles^ — it will finish sending valid data during its — ^-6-^ 

window and then lock down z cycles. During the lock - down period^ 

m3 — valid — err — idle — data — ts — sent . Instead/ — a — special — lock-down — ^ 

character is transmitted which will be recognized by the receiver. 
The receiver will not write the lock - down characters into its input 

10 FIFOs . This — will — ensure — that — the input — FIFOs — can^ t — overflow . 

Since the sequence counter does not advance for the amount of lock 

down, — it is effectively resetting itself to the sync pulse. It is 

equivalent of having the sync pulse appear at the start of the — ^-9^ 
count — window — since — ttre — transition — to — a — count — — HH^ occurs 

15 precisely one tick length after the sync pulse arrives. When the 

next — sync pulse arrives;. — if clock frequencies are constant, — then 
the — sync — pulse — should — appear — dm — Wte — — count — window — at^td — th^ 
calculated — lock - down — amount — will — be — thre — same — sr^ — the — previous 
calculation . This — allows — the — system — to — always — expect — the — sync 

2 0 pulse arrival in the — — count window even if the clocks generating 
the sequence counter are too fast or too slow. 



The — Receive — Synchronizer — block — will — rrse — the — sequence 

counter — to — determine — when — to — accept — data — from — input — byte — sync 

FIFOs . Once — a — sync — character — irs — read, — pops — from the — FIFOs — will 

2 5 only occur once the — sequence counter transitions — from ' ^ ' ' 0^^ to ' ' ^1^'' 

•arrd — immediately — following an arrival — of a — sync pulse . The — read 

decision is only made once every sync pulse arrival and only at the 

— to — — transition — erf — the — receive — sequence — counter . The 

sequence — counter — jrs — also — used — during — fabric — resync — in — orde r — to 
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communicate — a — fabric — resync — ti3 — aii — channels — in — sr±i — aggregators 

during a — sequence — count — transition . Fabric — resync cells — will — be 

transmitted — a± — ttte — beginning — crf — a — sequence — tick — window — &rrd — are 

prefixed — by — a — special — character — indicating — a — resync — cell . Wte 

5 receive — synchronizers — in — tH^te — Aggregators — will — re synchronize — att 
data — going to the memory controllers — on the — next — sequence — count 
transition once the resync character has been received. 

A block diagram of the receive Synchronizer can be seen 

im — Figure — Wte — Receive — Synchronizer — consists — of 24 — Dyte-sync 

FIFOs f — a Crossbar and G Bus Synchronizers. There is one byte sync 

FIFO per gigabit receiver. Each byte sync FIFO will accept — data 

from each gigabit receiver independent of the mode of the switch. 

The byte sync FIFO depth is about 25G words deep. This depth is 

based on a derivation found in the Backplane Synchronizer ADG . — 
Crossbar will handle the assignment of the appropriate input byte 
lanes to the correct channels. — Each Bus Synchronizer will consist 

of four Channel FIFOs and one Bus Controller. The Bus Controller 

ean — handle — 4 — separate — 0040 — channels — err — ©rre — OC192c — stream. Wte 

channel — FIFO — i:5 — about — te — words — deep . Wte — depth — — based — cm — th^ 

number of words to read a 30 bit routeword. — The whole routeword is 
read and then presented to the rest of the Aggregator in one cycle 
since it needs to be stored before the data of the packet as it is 
constructed and sent to the memory controller. 

Multiple gigabit receivers make up a 24-bit data bus and 

2 5 2 - bit routeword bus for one channel of an Aggregator. — Each gigabit 

receiver can handle up to 0 bits. Due to varying transport delays 

that — eatt — exist between — receivers, — bytes — from different — receivers 
that belong to the — same word can be — skewed from each other. For- 
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example, — Hre — 24" - bit — data — bvrs — gmd — 2 bit — routeword — fotts — fisnr — one 
channel — of an aggregator will have 4 — receivers — that make — up the 
bus , — The synchronization logic will align all 4 bytes for the 20 
bit bus — and will pass — this byte aligned word to — the rest — of the 

5 Aggregator . In order to align the bytes, — the Otriper will need to 

send — a — special — alignment — byte — — each — receiver . ^ — special — 

character — cmi — be — utilized — from the — gigabit — transceivers . Wte — 

character — will — be — encoded — irt — the — data — bits — on — ttre — Gigabit 
transmitter and will be detected on the Gigabit receiver. — 

10 The receive synchronizer in the Aggregator will consist 

of 24 — FIFOs where there is one FIFO per Gigabit Receiver. These 

FIFOs will handle both byte alignment arrd the backplane 

synchronization . It is assumed that the Gigabit Receivers will be 

able to distinguish between valid, — idle, — sync and lock down cycles 

15 and will — indicate these various cycles to the Aggregator by using 
3 control — signals . 

On startup, — the FIFOs will be empty and each Write Gtate 

Machine (WGM) will wait until a sync character is seen on its input. 
From this point on, every cycle will be pushed except for lock - down 

2 0 cycles — from — the — fabric . When — the — fabric 1 S 1 oc king — down, — the 

Gtripers will send special lock down characters. This is done to 

avoid overflowing the — sync FIFOs — in case the write — side — clock is 

faster than the read side clock. While particular types of words 

are being pushed, the word type will also be written to the FIFO so 

2 5 it can be distinguished on the read side. 

The WOM is also looking for a sp e cial fabric resync cell 

K character that will indicate that a fabric queue resync cell will 
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immediately follow. If a resync cell is detected; — a resync signal 

is passed along to Bus Controller. The Bus Controller will then 

tell other Aggregators on the fabric to resync their queues at the 

next — transition of the sequence counter. Fabric queue resync is 

5 described in more detail later. 

Gigabit receivers are not dedicated to particular input 

channels^ — but instead shared between various channels. Each byte 

sync — FIFO works — independently of the — switch mode and each — input 
lane needs to be steered to the correct channel FIFO. — For instance 

10 in 4 0 mode, — 2G bits of data and routeword are required for Bus — 377- 
channel A and therefore 4 byte lanes are required to be steered to 
each channel of Bus — 3r-: — In 0 0/12 0 mode, — only 0 bits of data and 2 
bits — erf — routeword — aire — required — mvd — therefore — two — bytes — will 
suffice , In 400 mode, — only 4 bits are required per channel and one 

15 byte — lane — will — suffice . As — switch — capacity — increases , — less — mrdc 

less byte lanes will be required for a particular channel. — For all 
switch — modes, — ttre — routeword — bits — — a — particular — channel — will 
always come from the same byte lane. — As the byte lanes get reduced 
1 byte lanes, — there will always be one comn'\on byte — lane 

2 0 used to carry the routeword data lines. The crossbar will take in 

24 lanes consisting of 0 bits of data and 3 bits of control along 
with — other — control — signals — to — communicate — with — the — Brx^ — Control 

logic . It will then forward all these signals to the appropriate 

channels . The Crossbar will also accept control data from the Bus 

2 5 Controller and forward signals such as read requests and FIFO flush 

signals to the appropriate input byte sync FIFOs. Each crossbar 

mapping between input byte lanes and channels is bi ' directional . 
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The Bus Controller consists of three state machirie& . — ?H*te 

state machines control the read side of the byte — sync FlFOs^. — t-he 
write side of the channel FIFOs and the read side of the Channel 

FIFOs ■ On the read side of the Byte FIFOs, — pops will not conmience 

5 until a sync pulse has arrived and the receive sequence counter has 

transitioned from ^^O^'^ to ' ' ^l^'^. A signal will be provided from the 

sequence generator block that indicates a ' ' ^O^^ to ^' ' l^^ transition at 

precisely — this — moment ( ^yr3 eleven fc) . ^ this time, t+re — Bvrs- 

Controller — issues — a — read — to — the — Crossbar — tcrr — Wte — particular 

10 channel , Wre — Crossbar — then — forwards — t+re — read — signal — tt) — th^ 

appropriate byte sync FIFOs based on the mode of the switch. ¥fre 

Crossbar — then — forwards — all data and control — from these byte — sync 

FIFOs — back — to — t+re — Btts — Controller — fxrr — this — channel ■ T+re — Brrs- 

Controller checks the data types to make sure that the first word 

15 in the appropriate byte sync FIFOs are a sync character. If the 

first word of any of the appropriate byte lanes for this channel is 

not — a — sync — character, then — a — sync — error — will — be — flagged, 

appropriate byte sync FIFOs will be flushed and the synchronization 
process — will — be — re initiated. — t+re — first — word — i-s — a — sync 

2 0 character, — then pops — will — continue . iri — OC40 — mode, — this — process 

will be performed independently for each channel. OC192c support 

is discussed later on . 

Once data starts being read from byte sync FIFOs, the Bus 

Controller — will — ignore — data — until — art — finds — ttre — first — idle — word. 
2 5 Once an idle word has been found, — it can now start looking for the 
BB^ — indication — dm — t+te — routeword — when — t+re — next — non idle — word — ins- 
read. — The rest of the routeword is processed and made available to 

ttte — rest — o* — ttte — Aggregator . H — tte — stop — bi± — in — bhre — routeword 

indicates — that — t+re — packet — i-s — continuing, then — data — will — be- 
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continuQusly — made — available — to — Wte — Aggregator — until — a — stop 

indication is read. Note that even though a OOF is seen;. — it does 

not mean that this segment — is the first segment of a packet. ft 

can be any segment of a packet. — Even though the segment may not be 
5 the first one of a packet, — it is allowed to go through the — switch 
and will be dropped later on. 

When a sync character is read, — a counter is initialized. 

¥tre — counter — counts — each — read from the byte — sync — FIFOs . Tire — Bt» 

Controller — will — expect — to — s^-e — a — sync — character — every — sync — pulse 

10 period (about 22, 000 cycles) . — If a sync character is read too early 
or too late, — then a sync error is flagged, — data is dropped at the 

precise — logical — cycle — cHE — where — a — sync — character — 3rs — expected . ^ 

packet that is being processed at the theoretical logical cycle for 
sync — will — be — terminated — aitd — inputs — will — be — disabled — until — re— 

15 enabled by 0/W. For example, — if after the first — sync character, 

tire — next — sync — character — occurs — at — cycle — 19, 000, — and — then — a — sync 
error is flagged. — Data is not dropped until 22,000 reads have been 
performed . — Also, — if after the first sync character, — the next sync 
character is not received at all after 22,000 cycles, — then a sync 

2 0 error is flagged and data is dropped at this precise logical cycle. 
If a sync character is received precisely 22,000 cycles after the 
last one, then reads from the byte sync FIFOs are stopped until the 
receive sequence counter transitions from — — to — HrH — Waiting for 
tire — to — Hr^ transition — will — ensure — that — a-ti — fabrics — aire- 

2 5 receiving the same stripe of a packet on the same logical cycle. 

For OC192c, 4 input channels need to be concatenated into 

one OC192c stream. In this mode, — the Bus Controller will control 

all 4 channel FIFOs and the appropriate byte sync FIFOs. — Data type 
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checking will be performed Sicross 4 times as many byte lanes as in 

the OC40 — case . When it — is time to read byte sync FIFOS;. — the Bus 

Controller will control 4 read control lineg to the Crossbar. 

Crossbar will initiate reads across all appropriate byte sync FIFOs 
5 that are required for OC192c and will present data back to the Bus 
Controller . — The Bus Controller will check data types and will look 
for GOP indications. — The GOF indication and stop bits will only be 

found — rn — t+re — Routeword — fer — channel — ^-i The — Btrs — Controller — will 

write — a-H: — 4 — channel — FIFOs — a± — Hte — same — time when writing data and 
10 will present the complete OC192c Routeword in one cycle to the rest 

of — the Aggregator . The — functions — crf — the — Btre — Controller — will — be- 

identical for OC40 and OC192c except that all 4 channel FIFOs will 
be controlled when in OC192c mode. 

Gpecial — cases — cart — be — broken — down — into — the — following 

15 categories : 

Fort card insertion 

ir-. Port card removal 

5-: Port card errors including: 

No sync character 

2 0 B-: Port card not locking down 

e-: Routeword parity errors 

Garbage data 

E~. Port card sending data too fast or too slow 

Fabric Queue resync 

2 5 -4-; Non synchronized updates to Gigabit network 

When — a — port — card — irs — inserted^ — the — pwt — card — present 

signal — will — be — asserted and — sent — to — ea^sh — fabric . Not — until — St^^ 
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enables the paL'ticular inputs and the Aggregator sees the port card 
present signal, — will the Aggregator be ready to accept data from 

the new port card . Once enabled, — the Aggregator will go through 

the process of looking for sync characters on individual byte lanes 

5 associated — with — tite — new port — card. tt — is — assumed that — th-e — port 

card will not send any data until it has been configured only after 

th-e — fabrics — have — been — initialized, Once — the — port — cards — aire* 

enabled, — they will — start — sending — sync characters — periodically at 

every global — sync — pulse — arrival . ft — irs — important — that — &±i — the- 

10 appropriate fabrics see the sync character from the particular port 
card — since — some — fabrics — will — be — initialized — later — than — others . 
After sync characters have been received, — all data will be written 
on each cycle excluding lock-d own c haracters . 

When — a — port — card — is — about — to — be — removed, — t+re — enable 

15 switch on the port card will be turned off. This will signal the 

port card to finish sending valid packets and then send idles. — 
port card will send a packet abort k character to indicate that no 
more — valid — packets — will — be — sent — immediately — following — t+re — last 

valid packet . ft — i-s — assumed that when the port — card — rs — actually 

2 0 removed, — i:t — will have — already sent — the packet — abort — k character . 

This is critical for the fabrics to keep their queues in sync. ft- 

is important that each Aggregator on each fabric that handles the 

particular port card stops forwarding data to the memory 

controllers at precisely the same logical cycle. — The WGM will stop 
2 5 writing — data — into — the — byte — sync — FIFOs — once — the — packet — abort 

character — i-s — seen . The Bus Controller will terminate the packet 

once the packet abort character is read out of the byte sync FIFOs. 

Case A: — No sync/early sync/late sync from port card. 
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Oolution : — The Synchronizer will — look for a sync at precisely the 

same — logical — cycle — each time. This will occur — every — sync pulse 

period that — irs — approximately 22, 000 — 12DMIIz — cycles . f-f — t+re — sync 

character — is not present at the head of the byte sync FIFOs when 
5 22^ 000 cycles have been read since the last sync character, — a sync 
error will be flagged and data will be dropped the cycle where the 
sync character should have been. — All fabrics need to drop data at 
precisely the — same — logical — cycle — fcrr — this particular — input — lane . 
Inputs for this particular channel will be turned off and the byte 

10 sync FIFOs used for this channel will be flushed. D/W will turn 

off^ — =bhe — offending — Otriper . Inputs — will — be — ignored — until — St^ 

enables these inputs again. If a sync character arrives too early, 

then data should be dropped at precisely the cycle where the early 
sync was read. Other Aggregators will make the same drop de c 1 s i o n 

15 i-f — this — error — i-s — common — ti3 — a-t± — fabrics . — ttre — sync — character 

3. r n ve s too late or not at all, then the drop decision will be made 

where — the — sync — character — wsrs — expected . Wte — sync — character — 

expected to arrive every 22, 000 cycles after the last — sync . 

Case D: — Port card not locking down. 

2 0 Solution : If the port card does not lock-down, — it will then send 

more than the ideal number of valid and idle cycles between sync 

characters . This will be caught by the same logic that checks for 

s ync c haracters — irt — th^ — correct — logical — cycles . Data — will — be- 

dropped the — same way as — in the case where no sync came — from the 

2 5 port card. 



Case C: — Routeword parity errors. 

Solution : — If a parity error is detected for a particular routeword, 
the packet will be terminated at the bad segment and a parity error 
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will — be — flagged. Data — will — be — dropped — after — this — terminated 

segment is forwarded to the rest of the Aggregator and FIFOs — &Mr 

this particular channel will be flushed. Inputs will be disabled 

until re ' enabled by 0/W. 

5 Case — B-: — Garbage data from port card while all — fabrics already in 
sync . 

Solution : — If the data is unrecognizable by the gigabit receivers^ 
errors will be formed and provided to the Aggregator by the gigabit 
receivers . ^rfe — the point — of error, — data being written into byte 

10 sync FIFOs will be flagged to be in error. If the Bus Controller 

sees — that — ttre — particular — byte — lane — rn — error — irs — not — used — fxrr — ttre 
Routeword bits, then the error will be flagged but the data will be 

passed ""on t O" downstream — logic . This nsidered — to be — a — soft 

failure since queues will still be able to stay in sync. If the 

15 Bus Controller sees that the particular byte lane in error is used 
for the Routeword bits, then the packet will be terminated and then 

dropped once the erred word is read from the byte sync FIFO. Wre 

input will be disabled, a gigabit receiver error will be flagged to 
0/W and byte sync and channel — FIFOs a s s oc iated v^ith this — channel 

2 0 will be flushed. This is considered to be a hard failure. If the 

failure occurs only for one fabric, then other fabrics can still be 
used to re "assemble the packets. — B/W will have to queue resync the 
bad fabric. — If this error occurs across multiple fabrics, not much 
can be — done — to avoid — fabric queues — from be rrupted. St^ 

2 5 will then have to queue resync all fabrics. 



Case — Er, — Port — card — sending — data — too — fast — crr — too — slow . it — drs- 

possible that the port card is sending the correct number of valid 
cycles — between s y nc c haracter s — btrt — i-s — not — locking — down — enough — crr 
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locking — down — t^ro — much — during — each — lock - down — period- Byte — sync 

FIFOs can eventually overflow or underflow respectively. If more 

than one fabric have FIFOs that overflow or underflow and data is 
dropped — grfe — different — logical — cycles — f-or — th^ — same — source, — then 
5 fabric queues — can become out of — sync ■ 

Solution : — This — i-s — considered a — hard — failure — since — irt — should not 

occur — — th^ — hardware — rs — working — correctly . 54te — only — w«ry — to 

possibly prevent — this — irs — tro — flag an error — if the — FIFOs — reach — arr 
almost full or almost empty threshold. — This is a warning sign that 

10 something — i-s — wrong . St^W — will — then — turn — o-f-f — ttre — offending — port 

card. Data will continue to be written to and read from the byte 

sync FIFOs as if nothing is wrong. If the port card can be turned 

off and idles be sent before byte sync FIFOs overflow, — then there 
will be no dropped data and fabric queues will stay in sync. i± 

15 FIFOs overflow or underflow for a particular channel, — then a FIFO 

overflow /under flow — error — will — be — flagged. Wre — packet — being 

processed — by — t+re — synchronizer — &t — the — time — crf — error — will — be- 

terminated . All data will be dropped from this point on. Inputs 

for this channel will be disabled until re enabled by 0/W. FIFOs 

2 0 for this channel will be flushed. 

Fabric queue resync i-s performed irt order to 

resynchronize memory controller queues. It is important that all 

fabrics — aire — processing — the — stripe — of — the — same — cell — or — packe - t — at- 
precisely the — same — logical — cycle — and that — grti — fabrics — a-re — acting 

2 5 together as one logical fabric. Fabric queue resync starts at the 

Gtripers . The Otriper will receive a queue resync cell — from the 

control port. ¥he — striper will decode the queue resync cell and 

will — back — op — traffic — until — the — next — sequence — counter — tick — irs 
reached. ^rt — this — point, — art — will — send — a — fabric — queue — resync — ^ 
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character — inmediately followed by the queue — resync cell. — Wte 

f abriC; — the WOM in the receive synchronizer will receive the queue 
' resync — K character — and notify the Bus Controller — in the — receive 
synchronizer that a queue resync cell is in the input FIFO and that 
5 the queue resync event should occur at the next transition of the 
r e Cci 1 ve sequeirce co unter . — The Bus Controller will then indicate to 
other Aggregators on the fabric that a resync cell event will take 

place — &t — ttre — next — transition — erf — ttre — sequence — counter . 

indication is asserted about 10 cycles before the receive sequence 

10 counter transitions . This is done to allow enough time for other 

Aggregators to see this assertion before their respective receive 

sequence — counters — transition — also . Once — Hte — sequence — count 

transition — occurs , — trh^ — Aggregators — will — signal — — th-e — memory 
controllers that a queue resync event has occurred and that — this 

15 event delimits old and new data. All data sent before the — sync 

event is considered old data and all data sent after the sync event 

is considered new data, The memory controllers synchronize their 

buffers accordingly . The resync cell — is eventually sent through 

the switch as a regular cell and returned to the control port. 

2 0 There can be times when the gigabit network is changing 

its operating mode and the switch is changing from a 40/00 to an 

0 0/120 — mode — for — example . There — irs — rro — guarantee — that — Gigabit 

Receivers will be driven by Gigabit Transmitters during this time 
period . Aggregators — that — expect — good data — from certain — Gigabit 

2 5 Receivers may not get good data. If the switch is increasing its 

mode, then a previously unused FIFO will now be used. If this FIFO 

has garbage data on its inputs^ then syncs will not be received and 
this FIFO will not be synced until the gigabit network is stable. 
Once the Gigabit network is stable, — idles and sync characters will 
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be transmitted by the — port cards and the — FIFOs will have enough 

time — tro — sync — tjpr: i-f — th^ — switch — — d e creasing — i-bs — mode, — then 

previously used FIFOs will now be unused. — The Aggregator will know 
the — rtew — switch capacity and will eventually — ignore — these — channel 
5 FIFOs. 

54te — Unstrip e r — needs — to — provide — back - pressure — to — ttr^ 

Separators when internal FIFOs — in the Unstriper become near full. 
Each Separator will expect 24 separate back pressure signals coming 
from — a-ti — Hte — port — card — channels — irt — i-s — connected — tm ^Hte — back 

10 pressure signal is considered to be asynchronous to all AOICs. Jt- 

is required that all relevant Separators receive back pressure from 
a particular channel in the Unstriper at precisely the same logical 

cycle , This — rs — done — by — having — the — Unstripers — assert — ttre — back 

pressure — signal when their — receive sequence — counter — transitions . 

15 It is assumed that the Unstriper^ s receive sequence counter — isrs — & 

delayed version of the Stripers transmit sequence counter. Since 

the tick length is 250 cycles and the re unter is delayed by 

150 cycle relative to the transmit counter, — there exists 100 cycles 
of margin to transport the back-pressure signal from the Unstriper 

2 0 to the Separator. The Separator needs about 10 cycles before the 

transition — erf — it-s — sequence — counter — to — sample — the — back ■ pressure 
signal . This will give the Separator enough time to provide back - 
pressure to the memory controller before the counter transitions. 
This places a maximum requirement on the propagation delay of the 

2 5 back '■ pressure signal . The following requirements hold true: 

Back pressure propagation delay — < counter — tick — length receive 

sync pulse delay setup time of Separator^ — sample point 



-112- 



Dack-pressure propagation delay < 2D0 i-Er^ i-Q- 

Back pressure propagation delay < 90 cycles @ — 125 MIIz or 720 ns 

Assuming worst " -case conditions , — the expected worst case 

propagation delay would be: 

5 Back - pressure propagation delay - — (Unstriper — tts — Otriper — delay) — h 
(Btriper to Aggregator delay) — \ — Aggregator to Separator Delay 
Back - pressure propagation delay - 5 cycles — (chip and board delay) 
H — (5 t G2 cycles) (chip and port card to fabric delay of 500 ns) — t — 5- 
cycles — (chip and board delay) 
10 Back-pressure propagation delay - 77 cycles — < — 90 cycles 

^-s — be seen from — this estimate^ the — maximum 

back - pressure propagation delay requirement is met. 

Assuming airi ttre relevant Separators receive thre 

back pressure — signal — before — the — transition — to — the — next — sequence 

15 count;. — then — rt — can be — synchronized to the next — transition of — the 
transmit sequence counter. — This will allow all relevant Oeparators 
to stop sending valid data at precisely the same logical cycle for 

one — complete — counter — tick — interval . This — ts — true — since — it — irs- 

assumed that — when — thre — transmit — sequence — counter — transitions , — the 

2 0 data that the Oeparators are sending are companion fragments of the 

same — packet . If back pressure — i-s — sampled — again — before — the — next 

counter transition, — then data will be stopped for another counter 

tick interval. This mechanism implies that back pressure can only 

be generated on a counter tick length granularity. 
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Since — there — rs — rre — direct — path — from — Unstriper — txr 

Separator, — the back pressure signals need to be re" '- routed from the 
Unstriper , — to the — Striper, — to the Aggregator — and finally to the 

Separator . In order to do this, — each Unstriper needs to send the 

5 back-pressure — signal — ^ — th^ — corresponding — Striper — on — that — port 

card. Wte — Striper — will — then — forward — W-te — back -pressure — signal 

through — t+re — backplane — gigabit — transceivers — onto — fe+re — Aggregator . 
The Aggregator will forward up to 24 separate back pressure signals 
to one Separator corresponding to G buses with 4 channels per bus, 

10 ¥he — back pressure — signal — will — always — trs^ — h±t — 9 — crf — the — gigabit 

transceivers . Wte — receive — synchronizer — block — irr — the Aggregator 

will — forward the correct back pressure — signal — f-or — the appropriate 

bus and channel to the Separator. Since the gigabit receivers are 

not dedicated to any particular bus and channel, — fe+re — synchr 

15 needs to select the correct gigabit receiver based on the switch 

configuration — j ust — like — arfe — does — ftrr — regular — data . Once — this — jrs- 

done, — b-irt — 9 — of the — gigabit — receiver — irs — forwarded on as — the back 

pressure — signal , Note — that — bit — 9 — irs — also — used — tor — receiving — k 

characters and can change when sending a k character. In order to 

2 0 avoid mistakenly — interpreting bit — 9 — of a — k character — srs — a — valid 

back pressure signal, — the synchronizer will only sample the back - 
pressure bit when valid data is received from the gigabit receiver. 
In the case where a k character is received, — the synchronizer will 
hold the back "pressure signal at its current value. — There is still 
25 a — case — where — t+re — Striper — cart — be — sending — back to back — idle 
characters since there is nothing to send. — If the Striper needs to 
change the value of the back pressure signal in this case, — then it 
will — send — one — crf — tvro — k — characters — that — change — th^ — back -pressure 
value . The two k characters that will be used are a set and clear 

3 0 ot — the — back - pressure — signal . if — t+re — synchronizer — receives — a- 
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back pressure — — err — clear — character^ — art — will — s^t — err — clear — t+te 

back - pressure — signal — respectively. tf — any cthei — k — character — 

received, — the current back - pressure signal is regained. If valid 

data — ts — received, — hirt — 9 — erf — t+re — appropriate — gigabit — receiver — 
5 sampled as the back pressure signal. 

Although the invention has been described in detail in 
the foregoing embodiments for the purpose of illustration, it is to 
be understood that such detail is solely for that purpose and that 
variations can be made therein by those skilled in the art without 
10 departing from the spirit and scope of the invention except as it 
may be described by the following claims. 



WHAT IS CLAIMED IS: 



1. A switch for switching packets in a network 
comprising : 

port cards which send packets to and receive packets from 
the network; and 

fabrics connected to the port cards for switching 
portions of the packets, each fabric having queues in which 
portions of packets are stored, each queue corresponding to one of 
the port cards, each fabric having a determining mechanism which 
determines which queue the portions of the packet should be placed 
in, the detecting mechanism dynamic to reflect changes in port card 
quantity without any change in connection data of the packets. 

2. A switch as described in Claim 1 wherein each fabric 
has a memory controller having the queues and the detecting 
mechanism. 

3. A switch as described in Claim 2 wherein the 
detecting mechanism includes an input lookup which identifies in 
which queue portions of the packet are placed. 

4 . A switch as described in Claim 3 wherein the input 
lookup identifies more queues than are present in the switch. 

5. A switch as described in Claim 4 wherein the fabric 
identifies which queues correspond to which output ports from a 
first signal it receives from the network. 
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6. A switch as described in Claim 5 wherein the input 
lookup has a 10-bit field. 

7. A switch as described in Claim 6 wherein the fabric 
receives a second signal which identifies which bits of the 10-bit 
field are to be used to identify the queue the portions of the 
packet are to be stored in. 

8. A switch as described in Claim 7 wherein the 10-bit 
field comprises bits 0-7 which identifies the output port to which 
the queue connects and bits 8 and 9 identifies a priority of the 
portions of the packet . 

9. A switch as described in Claim 8 wherein the second 
signal has a 2-bit field which indicate which 8 of the 10 bits of 
the input lookup are to be used to identify the queue the portions 
of the packet are to be stored in. 

10. A switch as described in Claim 9 wherein the 8 bits 
of the 10 bits can be either bits 0-5, 8 and 9 which are 4 
priorities on up to 64 output ports, or bits 0-6 and 8 which are 2 
priorities up to 128 output ports, or bits 0-7 which are 1 priority 
up to 256 output ports . 

11. A switch as described in Claim 10 wherein the fabric 
has an aggregator which receives portions of packets and connects 
to the memory controller, and a separator which connects to the 
memory controller and sends portions of the packets to the port 
cards . 
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12. A switch as described in Claim 11 wherein the port 
card includes a striper which sends portions of packets as stripes 
to the aggregator of each fabric, and an unstriper which receives 
portions of packets as stripes from the separator of each fabric. 

13. A method for switching packets in a network 
comprising the steps of: 

receiving packets at port cards of a switch from the 

network; 

sending portions of the packets as stripes to a 
respective number of fabrics of the switch; 

storing the respective portions of packets in queues of 
the fabric corresponding to port cards the portions of the packets 
are to be sent to from the respective fabrics; 

sending the portions of packets as stripes to the port 

card; 

transmitting packets from the port card to the network; 

changing the number of port cards in the switch; 

receiving more packets at the port cards; 

sending portions of the more packets to the number of the 
fabrics after the number of the fabrics has changed; and 
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storing the portions of the more packets in the queues 
corresponding to the port cards the portions of the packets are to 
be sent to without any change to connection data in the packets. 

14. A method as described in Claim 13 wherein the 
storing step includes the step of looking up in an input lookup, 
which identifies in which queue portions of the packets are placed, 
which queue the portions of the packets are to be placed. 

15. A method as described in Claim 14 including after 
the changing step, there is the step of receiving a first signal 
which identifies in which queues portions of the packets are to be 
placed . 

16. A method as described in Claim 15 including after 
the receiving the first signal step, there is the step of receiving 
a second signal which identifies which bits of a 10-bit field of 
the input lookup are to be used to identify the queue the portions 
of the packet are to be stored in. 

17. A method as described in Claim 16 wherein the 
receiving the second signal step includes the step of reviewing a 
2-bit field of the second signal which indicate which 8 of the 10 
bits of the input lookup are to be used to identify the queue the 
portions of the packets are to be stored in. 

18. A method as described in Claim 17 wherein each 
fabric has a memory controller having the queues and the sending 
portions of packets step includes the step of sending the stripes 
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to an aggregator of each fabric which receives portions of packets 
and connects to the memory controller. 

19. A method as described in Claim 18 wherein the 
portions step includes the step of sending with a separator of the 
fabric which connects to the memory controller portions of the 
packets as stripes to the port cards. 

20. A method as described in Claim 19 wherein the 
sending portions step includes the step of sending with a striper 
portions of packets as stripes to the aggregator of each fabric. 

21. A method as described in Claim 20 wherein after the 
sending with the separator step, there is the step of receiving the 
stripes from the separator of each fabric at an unstriper of each 
port card. 
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ABSTRACT OF THE DISCLOSURE 

DYNAMIC QUEUE UTILIZATION 

A switch for switching packets in a network. The switch 
includes port cards which send packets to and receive packets from 
the network. The switch includes fabrics connected to the port 
cards for switching portions of the packets. Each fabric has 
queues in which portions of packets are stored. Each queue 
corresponds to one of the port cards. Each fabric has a 
determining mechanism which determines which queue the portions of 
the packet should be placed in. The detecting mechanism is dynamic 
to reflect changes in port card quantity without any change in 
connection data of the packets. A method for switching packets in 
a network. The method includes the steps of receiving packets at 
port cards of a switch from the network. Then there is the step of 
sending portions of the packets as stripes to a respective number 
of fabrics of the switch. Next there is the step of storing the 
respective portions of packets in queues of the fabric 
corresponding to port cards the portions of the packets are to be 
sent to from the respective fabrics. Then there is the step of 
sending the portions of packets as stripes to the port card. Next 
there is the step of transmitting packets from the port card to the 
network. Then there is the step of changing the number of port 
cards in the switch- Next there is the step of receiving more 
packets at the port cards. Then there is the step of sending 
portions of the more packets to the number of the fabrics after the 
number of the fabrics has changed. Then there is the step of 
storing the portions of the more packets in the queues 
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corresponding to the port cards the portions of the packets are to 
be sent to without any change to connection data in the packets. 



